angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

dendrogramToClusters() returning wrong number of clusters #47

Closed seanareed closed 7 years ago

seanareed commented 7 years ago

I'm sometimes getting 5 clusters back when I'm specifying 4.

Is this expected?

Example:

$tset = new TrainingSet();
$points = array(0.60, 0.61, 0.62, 0.66, 0.67, 0.70, 0.71, 0.90);  
foreach ($points as $p) { 
    $tset->addDocument(   
        '', new TokensDocument(array('x' => 0, 'y' => $p))    
    );    
} 
$clust = new Hierarchical(
    new SingleLink(),     
    new Euclidean()       
);
$dendrogram = $clust->cluster($tset, new DataAsFeatures());       
$dclusters = Hierarchical::dendrogramToClusters($dendrogram, 4); 

Result (formatted for JSON response):

"dclusters": [
    [0, 1],
    [2],
    [3, 4],
    [5, 6],
    [7]
]

Changing the source data to the following works as I expect:

$points = array(0.60, 0.61, 0.62, 0.66, 0.67, 0.70, 0.71, 0.90, 0.91,
                0.92, 0.93, 1.01, 1.02, 1.03, 1.14, 1.15);
"dclusters": [
    [0, 1, 2],
    [3, 4, 5, 6],
    [7, 8, 9, 10, 11, 12, 13],
    [14, 15]
]
angeloskath commented 7 years ago

Sorry for the late reply, I will be checking it tonight but it seems to be a bug.

angeloskath commented 7 years ago

Ok so I 've looked into it a bit and I am sorry to tell you that it is not a bug. Given just a dendrogram it is not always possible to extract a specific number of clusters. I will try to illustrate my point using the following image.

dendrogram

So you see given just the dendrogram there is no way to create 4 clusters. I 'll leave the issue open for a few more days if you want to add something more. By the way it is also mentioned in the comment of the function dendrogramToClusters()

seanareed commented 7 years ago

Your explanation makes total sense.

No need to leave the issue open on my account, since I think it's clear that the result is a limitation of the source data and not a software issue.

Thanks!

angeloskath commented 7 years ago

Cool! But maybe I should add a suggestion for enhancement because if the cluster() method also returns the order of the joins in the dendrogram, then there may be a possibility to extract exactly N clusters.

Anyway, thanks for understanding and thanks for the issue.