bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
564 stars 63 forks source link

Questions about choosing coherence measures #172

Open juneMJ opened 2 years ago

juneMJ commented 2 years ago

Hello, I'm trying several models with different coherence measures, but I have some questions I need to understand.

  1. Is the value of the SLIDING_WINDOWS fixed? or can I change it withing a range so I can compare which size is the best?
  2. I'm modeling social media posts, so the lengths of the posts are either long or very short, in this case what would be better for probability estimation: DOCUMENTor SLIDING_WINDOWS?
  3. For Pachinko Allocation model, I get some of the values of the C_V coherence per topic defined as nan, what could be the problem?

Thank you very much.

bab2min commented 2 years ago

Hi @juneMJ

The coherence measures actually are defined like below: https://github.com/bab2min/tomotopy/blob/d30964ce0610a5e34d3645cfc8c26d99536cac03/tomotopy/coherence.py#L62-L67 The second value is the default size of sliding windows. If you don't provide the window_size argument for coherence.Coherence(), the above default values are used. To find the best window_size, you should do some experiments to evaluate how well each coherence score with a specific window_size actually matches human's evaluation. But this is costly, so it is recommended to use the default values suggested in several papers.

I think, it is enough to use the preset ('u_mass', 'c_uci', 'c_npmi') rather the specific combinations. The 'c_v' isn't not recommended since it has some issues(#121, #126).

And for the PAModel, it seems to have bug at implementation of Coherence module. I'll check more on this.

juneMJ commented 2 years ago

Thank you @bab2min for the clarifications!