Is there a way to 'weight' docs?

bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool

https://bab2min.github.io/tomotopy

MIT License

548 stars 62 forks source link

Is there a way to 'weight' docs? #160

Open batmanscode opened 2 years ago

batmanscode commented 2 years ago

I'm working with tweets and want to weight them by likes; I couldn't find an obvious way to do this going over the docs.

Is this possible?

bab2min commented 2 years ago

Hi @batmanscode, Unfortunately there is no way to weight docs currently. Actually, I have conducted several experiments with different doc weightings before, but they didn't show any improvement compared to the original one. So I dropped the document weighting feature from tomotopy.

However, you can simulate doc weighting similarly by adding the same document multiple times. I recommend you to run the experiment by simulating weighting first. Divide documents into several sections by their number of likes, and try to insert documents a different number of times depending on the section, e.g. documents in the highest section 10 times each and documents in the smallest section once. I think, if this experiment shows some improvements, it is worth to implement document weighting feature.

batmanscode commented 2 years ago

Very interesting @bab2min, thanks for sharing about your experiments and thanks for the suggestion!

At the moment I've simply multiplied each tweet by the number of likes and so far this seems to provide better results

There are some considerations however:

Weighting is most effective when there is a large range i.e. 0-50k likes
Weighting is less effective (similar to not weighting) when the range is smaller i.e. 2.5k-50k likes
Most accurate results seem to be when taking specific "bins" of tweets instead of multiplying by likes i.e. 50-100 likes, 1-2k likes etc

Also I noticed that when I both multiply tweets by likes and use min_df=1000, min_cf=10, I get a much better log likelihood. Around -4.5 compared to -6.5; I would've thought that both weighting and using min_df might have been a little redundant

I will reply back here after experimenting more if weighting (or some other method) delivers the better results overall. Thanks

bab2min commented 2 years ago

@batmanscode Thank you for sharing your detail experience!! Most of your words sound reasonable. However, there seems to be a pitfall in improving log likelihood by adjusting min_df and min_cf. If you set min_df and min_cf larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood. Aside from that, I'll consider implementing doc weighting into tomotopy.

batmanscode commented 2 years ago

@batmanscode Thank you for sharing your detail experience!! Most of your words sound reasonable. However, there seems to be a pitfall in improving log likelihood by adjusting min_df and min_cf. If you set min_df and min_cf larger, the more uncommon words are excluded. This naturally causes increasing the value of log likelihood. Aside from that, I'll consider implementing doc weighting into tomotopy.

Right that makes sense! I hadn't considered that, thank you