angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

Update Lda.php #46

Open hamzarabbani opened 7 years ago

hamzarabbani commented 7 years ago

There's an error on line 243. Which is resolved with try catch.

angeloskath commented 7 years ago

Hi,

Thanks for the pull request! Could you perhaps provide a test that fails without the change? If you don't feel like providing one I will look into it more when I have next week.

Thanks again, it may take some time but I will probably merge it.

Angelos

hamzarabbani commented 7 years ago

Simple example to generate error:

use NlpTools\FeatureFactories\DataAsFeatures;
use NlpTools\Tokenizers\WhitespaceTokenizer;
use NlpTools\Documents\TokensDocument;
use NlpTools\Documents\TrainingSet;
use NlpTools\Models\Lda;

if (file_exists($argv[1])
    $docs = file($argv[1]);
else
    die(1);
$tok = new WhitespaceTokenizer();
$tset = new TrainingSet();
foreach ($docs as $f) {
    $tset->addDocument(
        '', // the class is not used by the lda model
        new TokensDocument(
            $tok->tokenize(
                file_get_contents($f)
            )
        )
    );
}
$lda = new Lda(
    new DataAsFeatures(), // a feature factory to transform the document data
    5, // the number of topics we want
    1, // the dirichlet prior assumed for the per document topic distribution
    1  // the dirichlet prior assumed for the per word topic distribution
);
// run the sampler 50 times
$lda->train($tset,50);
print_r(
    $lda->getDocumentsPerTopicsProbabilities(10)
);

It will generate error. Because at line 243 in Lda.php it is written.

$count_topics_docs = array();
         foreach ($this->count_docs_topics as $doc=>$topics) {
             foreach ($topics as $t=>$c)
                $count_topics_docs[$doc][$t]++;
         }

$count_topics_docs[$doc][$t] has not been assigned any value at this point so we just cannot add ++ to it. I solved this problem by using try catch

$count_topics_docs = array();
         foreach ($this->count_docs_topics as $doc=>$topics) {
             foreach ($topics as $t=>$c){
                 try {
                     $count_topics_docs[$doc][$t]++;
                 } catch (\Exception $ex){
                     $count_topics_docs[$doc][$t] = 1;
                 }
             }

There's another issue as well. On line 250 & 252 it is using variable named $limit_words. Which is not defined. It should have been $limit_docs (as this variable is being passed into parameter).

if ($limit_words>0) {
 arsort($p);
 $p = array_slice($p,0,$limit_words,true); // true to preserve the keys
}

to

if ($limit_docs > 0) {
 arsort($p);
 $p = array_slice($p,0,$limit_docs,true); // true to preserve the keys
}
angeloskath commented 7 years ago

So, do you care about getting the merge or should I just go ahead and fix it?

I am thinking the following if you care to do it yourself for some reason

Tell me if you want to do this and thanks again for the issue and pull request!!!