bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
557 stars 62 forks source link

Error with saving/loading PLDAModel #204

Open juhopaak opened 1 year ago

juhopaak commented 1 year ago

When I train a PLDAModel and then save and load it, after load the model's properties have changed.

For instance

from tomotopy import PLDAModel
docs = [['foo'], ['bar'], ['baz'], ['foo', 'bar'], ['baz', 'bar']]
mdl = tp.PLDAModel(latent_topics=2)
for doc in docs:
    mdl.add_doc(doc)
mdl.train(100)
print(mdl.summary())
print(mdl.perplexity)

produces

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -1.94159
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3261328688 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [3.0139716 7.3531275]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo

6.985572470333207

but after calling

mdl.save('model.bin', full=True)
mdl = PLDAModel.load('model.bin')
print(mdl.summary())
print(mdl.perplexity)

I get

<Basic Info>
| PLDAModel (current version: 0.12.4)
| 5 docs, 7 words
| Total Vocabs: 3, Used Vocabs: 3
| Entropy of words: 1.07899
| Entropy of term-weighted words: 1.07899
| Removed Vocabs: <NA>
| Label of docs and its distribution
|
<Training Info>
| Iterations: 100, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -2.19768
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 0 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 0 (the number of top words to be removed)
| latent_topics: 2 (the number of latent topics, which are shared to all documents, between 1 ~ 32767)
| topics_per_label: 1 (the number of topics per label between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 3666141070 (random seed)
| trained in version 0.12.4
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic distributions)
|  [0.1 0.1]
| eta (Dirichlet prior on the per-topic word distribution)
|  0.01
|
<Topics>
| Latent 0 (#0) (2) : foo bar baz
| Latent 1 (#1) (5) : bar baz foo
|

9.004082581035151

The model has diverging Log-likelihood per word and perplexity scores before and after saving/loading. I've tried this with both full=True and full=False, and with saves/loads, but the issue persists.