bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
560 stars 63 forks source link

Unbalanced time periods in DTM model #115

Closed ebergam closed 3 years ago

ebergam commented 3 years ago

As I understand it, DTM model takes variable t = [0, T) number of timepoints, and slices the corpus evenly, according to the integer. Hence, with N documents, each time period would include N/T documents. In this way, it's not possible to have more documents in period t=1, then there are in t=2.

Am I misunderstanding?

If I am not, this could be much more flexible (especially for applied work) if it was possible to pass an array (or list) of t timepoints, where T is equal to the number of documents in the corpus, and each t indicates the respective time period of each document. In such a fashion, the DTModel could be easily applied to time-imbalanced datasets, which I believe represent a lot of real-world cases. What do you think?

Thanks a lot

bab2min commented 3 years ago

@ebergam Oh, I think there seems to be a misunderstanding in slicing the timepoints evenly. It does not mean that all timepoints should have the same number of documents. Actually each timepoint can have different number of docs. Slicing the time evenly means that the spacing between each adjacent timepoint is the same. It is a major limitation of DTM because some models such as cDTM(continuous DTM) can accept arbitrary time intervals.

So, for example, if you use DTModel and add documents in 2000-2005 for t=0 and documents in 2005-2010 for t=1, you have to add documents in 2010-2015 for t=2.

ebergam commented 3 years ago

Hi @bab2min , thanks a lot for your kind reply, now it's much clearer!