angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

Hierarchical Clustering #69

Closed williamjulianvicary closed 5 years ago

williamjulianvicary commented 5 years ago

I'm looking at using Hierarchical Clustering to identify groups of similar documents based on an array intersection of shared similarities.

Example data: Clustering jobs based on common applicants to each job, so if a job shares applicants with another job they are assumed to be similar.

My data is similar to this:

$data = [
   ['John Smith', 'Joe Bloggs', 'Jack Craw'],
   ['Jack Smith', 'Johnny D', 'Joseph P'],
   ['John Smith', 'Joe Bloggs', 'Jack Craw'],
   ['Jack Smith', 'Johnny D', 'Joseph P']
];

I've added a Jaccard distance calculator:

class Jaccard implements DistanceInterface
{
    public function dist(&$A, &$B)
    {
        $arrayIntersection = array_intersect( $A['urls'], $B['urls'] );
        $arrayUnion = array_unique(array_merge( $A['urls'], $B['urls'] ));

        return count( $arrayIntersection ) / count( $arrayUnion );
    }
}

The distances are calculating as expected with higher numbers indicating a closer match in the matrix however the results after the Clustering takes place are seemingly random.

Am I barking up the wrong tree by attempting to cluster in this manner?

angeloskath commented 5 years ago

Hi,

It is just that you are using the Jaccard similarity as a distance. Small distances indicate close match thus the result of the clustering is basically unrelated documents.

In addition, the JaccardIndex is implemented in NlpTools so all you need is a proper feature factory. Something like the following might work

use \NlpTools\Similarity\JaccardIndex;
use \NlpTools\FeatureFactories\FunctionFeatures;
use \NlpTools\Clustering\Hierarchical;
use \NlpTools\Clustering\MergeStrategies\CompleteLink;

// assume that we have the doc set as $tset
$clust = new Hierarchical(new CompleteLink(), new JaccardIndex());
$dendrogram = $clust->cluster($tset, new FunctionFeatures(array(function ($doc) {
    return array_keys($doc["urls"]);
})));

If for some reason you do not want to use the provided JaccardIndex you can simply return 1-count($arrayIntersection)/count($arrayUnion) and it will also work.

Cheers, Angelos

williamjulianvicary commented 5 years ago

Wow, thanks for the quick response, much appreciated! I'm not sure how I missed the JaccardIndex already available - doh!

I tweaked the code slightly as $doc was returning an empty string (seems this is $class rather than $doc), example code:

$results = [
            ['job' => 'Job Title 1', 'applicantIds' => [1,2,3,4,5] ],
            ['job' => 'Job Title 2', 'applicantIds' => [6,7,8,9,10] ],
            ['job' => 'Job Title 3', 'applicantIds' => [6,7,8,9,10] ],
            ['job' => 'Job Title 4', 'applicantIds' => [6,7,8,9,10] ],
            ['job' => 'Job Title 5', 'applicantIds' => [6,7,8,9,10] ],
            ['job' => 'Job Title 6', 'applicantIds' => [1,2,3,4,5] ],
            ['job' => 'Job Title 7', 'applicantIds' => [1,2,3,4,5] ],
            ['job' => 'Job Title 8', 'applicantIds' => [1,2,3,4,5] ],
            ['job' => 'Job Title 9', 'applicantIds' => [1,2,3,4,5] ]
        ];

        foreach ($results as $result)
        {
            $tset->addDocument('', new RawDocument($result));
        }

        $clust = new Hierarchical(
            new CompleteLink(),
            new JaccardIndex()
        );

        $dendrogram = $clust->cluster($tset, new FunctionFeatures(array(function ($class, $doc) {
            return array_values($doc->getDocumentData()['applicantIds']);
        })));

        print_r($dendrogram);

        $clusters = Hierarchical::dendrogramToClusters($dendrogram, 2);

        foreach ($clusters as $i => $cluster) {
            echo "Cluster " . $i . "\n";
            foreach ($cluster as $item) {
                echo "\t" . $results[$item]['job'] . "\n";
            }
        }

This then returns groupings as you would expect:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => Array
                        (
                            [0] => Array
                                (
                                    [0] => Array
                                        (
                                            [0] => 0
                                            [1] => 5
                                        )

                                    [1] => 8
                                )

                            [1] => Array
                                (
                                    [0] => 6
                                    [1] => 7
                                )

                        )

                    [1] => Array
                        (
                            [0] => Array
                                (
                                    [0] => Array
                                        (
                                            [0] => 1
                                            [1] => 2
                                        )

                                    [1] => 3
                                )

                            [1] => 4
                        )

                )

        )

)
Cluster 0
    Job Title 1
    Job Title 6
    Job Title 9
    Job Title 7
    Job Title 8
Cluster 1
    Job Title 2
    Job Title 3
    Job Title 4
    Job Title 5

Thank you!

Is there a method that would help to identify an ideal number of clusters to return? Or is this beyond the scope of the Hierarchical Clustering? I did see KMeans as an alternate option as it defaults to a single level of returned results, but the EuclideanCF doesn't seem compatible with the JaccardIndex (the distances are all 1), and that still requires an input cluster number.

$clust = new KMeans(
            2, // two clusters
            new JaccardIndex(),
            new EuclideanCF()
        );

Thanks for all your efforts with the code, it's a great implementation!

williamjulianvicary commented 5 years ago

On a second look I think DBScan may be a more applicable clustering approach, I see you've got something in develop - I'll look into that :-)

Thanks again!

angeloskath commented 5 years ago

I am closing the issue, if you have any more problems let me know.