bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
548 stars 62 forks source link

got negative Coherence for DTM, including NPMI, UCI and U_mass #218

Open Garren87 opened 2 weeks ago

Garren87 commented 2 weeks ago

This is a great project which helps a lot. I am using DTM on a set of abstracts of english scientific papers (about 60000, spanning from 2000 to 2024) on the same topic: Electrochemical Energy Storage. I am trying to decide the optimal topic number K based on common indicators like coherence and perplexity. However, seems that all the coherence measurements (which are provided by tp.coherence.Coherence().get_score()) are negative, including c_npmi, c_uci, u_mass. Besides, c_v seems to be working, but other users mentioned that there are also problems within c_v. By the way, the results I got with pyLDAvis were also not good, with a large overlap between topics. I have tried many changes, including different k from 2 to 100, different parameters setting such as timepoint , rm_top and min_df, but the result did not improve. Does this mean that there is a problem with my corpus? P.S. there is an error with DTM training when k=1, gotProcess finished with exit code -1073741819 (0xC0000005)

Garren87 commented 2 weeks ago

Well, i have tested LDAmodel, and all the coherence measurements work well. What's more, even the result of pyLDAvis turns into clear and meaningful. Does this mean that my corpus are not suitable for DTM, or it still has some problems?