At line 90, the calculation of background frequency seems wrong. Since background freq is essentially the probability of words (unigram language model probability), the correct expression should be master.toarray().sum(0)/master.toarray().sum().
Here is what I found with ipdb when run the model on AG dataset
No error message when run on AG dataset because num_doc > num_vocab. But with my smaller dataset, where num_doc < 3000, there is error (dimension mismatch during training).
I fix the bug; the fix is exactly what I suggested --- master.toarray().sum(0)/master.toarray().sum()
At line 90, the calculation of background frequency seems wrong. Since background freq is essentially the probability of words (unigram language model probability), the correct expression should be
master.toarray().sum(0)/master.toarray().sum()
.Here is what I found with ipdb when run the model on AG dataset
No error message when run on AG dataset because
num_doc
>num_vocab
. But with my smaller dataset, where num_doc < 3000, there is error (dimension mismatch during training).I fix the bug; the fix is exactly what I suggested ---
master.toarray().sum(0)/master.toarray().sum()