bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
556 stars 62 forks source link

Question about SLDAModel input parameter #82

Closed XiaoSong9905 closed 3 years ago

XiaoSong9905 commented 3 years ago

Hi,

Thanks a lot for your work on tomotopy!

I'm using the SLDAModel https://bab2min.github.io/tomotopy/v0.9.0/en/#tomotopy.SLDAModel but kind of confused about what does the parameter means. Wish you can help me.

I'm trying to train SLDA to predict document categories based on document information. The data I have is a text for each document and a category label (politics/travel/...) for that document (41 categories in total). I've set the number of the topic of SLDA (k) to be 41, the same as the total number of categories I want to predict. When inference, I'll choose the topic with the highest likelihood to be the category of this document.

model = tp.SLDAModel(vars=['l'], alpha=0.1, eta=0.01, seed=451, k=41)

for data_i in data[:500]:
        model.add_doc(words, y) # words is list of word, y is a int between [0, 40]
        # I'm not sure if adding y in this way is correct. Please correct me if I'm wrong

model.train(200)
doc_inst = model.make_doc('example document information separate by white lines'.split())
model.infer(doc_inst)[0].argmax() # argmax return the index of the topic with highest likelyhood

I'm kind of confused about how to set vars in SLDAModel. In the document, it mentioned, "The length of vars determines the number of response variables, and each element of vars determines a type of the variable." In my case, should be "number of response variables" be 1 or 41? and should my type of variable be binary/linear?

Also, I'm confused about the estimate() function inSLDAModel. In the document it mentioned "If doc is an unseen document instance which is generated by SLDAModel.make_doc() method, it should be inferred by LDAModel.infer() method first." Is that means I should

doc_inst = model.make_doc('example document information separate by white lines'.split())
model.infer(doc_inst) # call this function first
model.estimate(doc_inst) # before calling this function? 

or it's fine to just

doc_inst = model.make_doc('example document information separate by white lines'.split())
model.estimate(doc_inst) # before calling this function?

Also, why does the estimate() returns a floating-point number? What does this floating-point number mean?

Thanks a lot for your help.

mtchibozo commented 3 years ago

This could work, but it seems very strange to use sLDA and ignore the regression parameters for the prediction. Just using model.estimate(...) will combine both information about the topics (including their likelihoods), and the estimated regression paramter of each topic and (I'm guessing) should lead to a more nuanced prediction. sLDA is a Mixed Membership model meaning each document contains a mixture of topics (you can read the original paper or the tomotopy documentation for a better explanation).

You don't need 41 clusters to predict 41 classes, but you will need 41 y variables which you can dummy encode into binary variables. In that case, the y variable will be a list of 41 zeros or ones, and sLDA will figure out the rest. Think of the model trained with 41 variables as a multivariate sLDA.

With the code that you wrote, sLDA will think you y variable is a continuous variable, which is not correct if your categories aren't ordinal.

model = tp.SLDAModel(vars=['l'], alpha=0.1, eta=0.01, seed=451, k=41)

for data_i in data[:500]:
       model.add_doc(words, y) # words is list of word, y is a int between [0, 40]
        # I'm not sure if adding y in this way is correct. Please correct me if I'm wrong
        # This is correct if y is a list of 41 binary values. In this case, you should define the model by:
        # model = tp.SLDAModel(vars=['b' for var in range(41)], alpha=0.1, eta=0.01, seed=451, k=10)
        # you might also want to use fewer clusters.
model.train(200)
doc_inst = model.make_doc('example document information separate by white lines'.split())
model.infer(doc_inst) #you have to add this if the document was not seen by the model during training.
model.infer(doc_inst)[0].argmax() # argmax return the index of the topic with highest likelyhood
model.estimate(doc_inst) # This is what you want to use for the predictions.

See the comments in the code above.

-"Also, I'm confused about the estimate() function inSLDAModel. In the document it mentioned "If doc is an unseen document instance which is generated by SLDAModel.make_doc() method, it should be inferred by LDAModel.infer() method first." Is that means I should"

doc_inst = model.make_doc('example document information separate by white lines'.split())
model.infer(doc_inst) # call this function first
model.estimate(doc_inst) # before calling this function? 

Yes, you should do this.

Again, you might want to read the sLDA paper. This is the prediction that sLDA will make - it is a supervised method. With your current code, since you use a single y linear ('l') variable which takes values from 1 to 41, sLDA will predict a number closest to the specific class. It might work better if you use 41 binary variables as mentioned above, but that depends on your data.

Hope this helps.