bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
557 stars 62 forks source link

CTModel Numerical Stability #165

Open MVP1996 opened 2 years ago

MVP1996 commented 2 years ago

Hi, big fan of the package, thanks for putting it together - I've had fantastic experiences with LDAModel and HDPModel, but am running into numerical stability issues as I'm trying out the CTModel.

A few quick clarifying questions on initialization :

Here's the numerical issue that I run into : I generate 2500 samples for my posterior, and I set the burn-in to 2000. For my burn-in, the samples are generated quickly (2 minutes for 2000 samples) and there are no issues. Once the burn-in phase is over however, the speed at which samples are generated becomes much slower (20 minutes for 500 samples, or worse) and I frequently encounter the following warning :

D:\a\tomotopy\tomotopy\src\TopicModel\../Utils/TruncMultiNormal.hpp(56): wrong truncation range [0.195418, 0.186098]

I also occasionally encounter :

D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (106): D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (101): doc.beta.col(9) is -nan(ind) Failed to sample! Reset prior and retry!

I understand that this comes from the vertical line trick in the sampler from the Mimno (2008) paper, but I'm not sure what I can be doing to mitigate this issue. I've tried varying all of the input parameters, but haven't had any luck consistently resolving this problem. What should I be doing to mitigate this problem? Also, is there any obvious reason why it only occurs after the burn-in phase is complete? This issue essentially renders the CTModel not usable for my application, and I'm very eager to figure out what's going on.

I've also found that if i use the tw=tp.TermWeight.PMI option instead of the default, this occurs more frequently.

I've included a zip with a reproducible example : it includes both code and a data sample. CTModel Example.zip

Any help would be greatly appreciated. Thanks in advance!

bab2min commented 2 years ago

Hello @MVP1996 , Thank you for sharing your details. First, to briefly answer each question:

For your third question, in burn-in time, estimating multivariate prior distributions is skipped for the faster training. After the burn-in time, estimating begins, and the problem of numerical instability arises.

Also as you found out, since the model in TermWeight.PMI has more irrational word counts, it yields more inaccurate results. Taken together, the current version of the implementation of tomotopy seems to have several weaknesses with respect to numerical stability that were not found in my testset. Based on the sample you shared, I will dig into this issue in depth. Thanks for sharing.

MVP1996 commented 2 years ago

Thank you for the clarification, it makes a lot more sense now. I look forward to the next tomotopy update!