angeloskath / php-nlp-tools

Natural Language Processing Tools in PHP
Do What The F*ck You Want To Public License
743 stars 152 forks source link

getDocumentsPerTopicsProbabilities Undefined offset: 0 #64

Open slava-vishnyakov opened 6 years ago

slava-vishnyakov commented 6 years ago

I'm trying to follow http://php-nlp-tools.com/posts/introducing-latent-dirichlet-allocation.html But trying to call getDocumentsPerTopicsProbabilities at the end:

$docs = [
    'The queen does something',
    'Queen is very good queen',
    'Mission mission mission',
    'What is mission your mission'
];

$tok = new WhitespaceTokenizer();
$tset = new TrainingSet();
foreach ($docs as $line) {
    $tset->addDocument(
        '', // the class is not used by the lda model
        new TokensDocument(
            $tok->tokenize(
                mb_strtolower($line)
            )
        )
    );
}

$lda = new Lda(
    new DataAsFeatures(), // a feature factory to transform the document data
    2, // the number of topics we want
    1, // the dirichlet prior assumed for the per document topic distribution
    1  // the dirichlet prior assumed for the per word topic distribution
);

$lda->train($tset,50);

$lda->getDocumentsPerTopicsProbabilities(2);

This results in:

Undefined offset: 0 at vendor/nlp-tools/nlp-tools/src/NlpTools/Models/Lda.php:243

image

This probably requires something along the lines of:

if (!isset($count_topics_docs[$doc])) {
    $count_topics_docs[$doc] = [];
}
if (!isset($count_topics_docs[$doc][$t])) {
    $count_topics_docs[$doc][$t] = 0;
}

also, further down you have a variable $limit_docs, which is undefined, maybe the signature of method is incorrect public function getDocumentsPerTopicsProbabilities($limit_docs = -1), maybe it's $limit_words there?

But, anyways, after running this method on this input:

$docs = [
    'The queen does something',
    'Queen is very good queen',

    'Mission mission mission',
    'What is mission your mission'
];
...
$lda->getDocumentsPerTopicsProbabilities(2);

I get this result:

[
0.3333333333333333,
0.3333333333333333,
0.3333333333333333,
0.3333333333333333
]

And I'm not sure how to interpret that... :)

Thanks!

slava-vishnyakov commented 6 years ago

One thing that might be is that it should be returning $p_t_d instead of $p, but that has no useful information either..

image

$p_t_d is
array:2 [▼
  0 => array:4 [
    0 => 0.33333333333333
    1 => 0.33333333333333
    2 => 0.33333333333333
    3 => 0.33333333333333
  ]
  1 => & array:4 [
    0 => 0.33333333333333
    1 => 0.33333333333333
    2 => 0.33333333333333
    3 => 0.33333333333333
  ]
]
slava-vishnyakov commented 6 years ago

Ok, maybe I have figured this out in PR #67

image