CTModel Numerical Stability

MVP1996 commented 2 years ago

Hi, big fan of the package, thanks for putting it together - I've had fantastic experiences with LDAModel and HDPModel, but am running into numerical stability issues as I'm trying out the CTModel.

A few quick clarifying questions on initialization :

For a model with Logistic-Normal prior, why are we not able to specify the Mu and Sigma parameters in the prior when we initialize the model? Is this something that can be added in a future version, or is there a specific reason for starting off with a fixed default Mu and Sigma?
When setting the prior, could you please clarify the role of "smoothing_alpha", which looks like a Dirichlet parameter. Does this get translated into initial parameters for the Logistic-Normal distribution? How should an asymmetric "smoothing_alpha" be interpreted in relation to the Logistic-Normal distribution?

Here's the numerical issue that I run into : I generate 2500 samples for my posterior, and I set the burn-in to 2000. For my burn-in, the samples are generated quickly (2 minutes for 2000 samples) and there are no issues. Once the burn-in phase is over however, the speed at which samples are generated becomes much slower (20 minutes for 500 samples, or worse) and I frequently encounter the following warning :

D:\a\tomotopy\tomotopy\src\TopicModel\../Utils/TruncMultiNormal.hpp(56): wrong truncation range [0.195418, 0.186098]

I also occasionally encounter :

D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (106): D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (101): doc.beta.col(9) is -nan(ind) Failed to sample! Reset prior and retry!

I understand that this comes from the vertical line trick in the sampler from the Mimno (2008) paper, but I'm not sure what I can be doing to mitigate this issue. I've tried varying all of the input parameters, but haven't had any luck consistently resolving this problem. What should I be doing to mitigate this problem? Also, is there any obvious reason why it only occurs after the burn-in phase is complete? This issue essentially renders the CTModel not usable for my application, and I'm very eager to figure out what's going on.

I've also found that if i use the tw=tp.TermWeight.PMI option instead of the default, this occurs more frequently.

I've included a zip with a reproducible example : it includes both code and a data sample. CTModel Example.zip

Any help would be greatly appreciated. Thanks in advance!

bab2min commented 2 years ago

Hello @MVP1996 , Thank you for sharing your details. First, to briefly answer each question:

It seems like a good idea to get the initial values of mu and sigma from the user. In the current implementation, it was omitted due to cumbersomeness, but it would be good to add it in the next update.
smoothing_alpha is involved in estimating the multivariate normal distribution from the sample. Specifically, look at the following code: https://github.com/bab2min/tomotopy/blob/d30964ce0610a5e34d3645cfc8c26d99536cac03/src/TopicModel/CTModel.hpp#L76-L93 In the above code, this->alpha indicates smoothing_alpha values. Without it, N_k sometimes becomes 0 resulting in max_uk being non-computable.

For your third question, in burn-in time, estimating multivariate prior distributions is skipped for the faster training. After the burn-in time, estimating begins, and the problem of numerical instability arises.

Also as you found out, since the model in TermWeight.PMI has more irrational word counts, it yields more inaccurate results. Taken together, the current version of the implementation of tomotopy seems to have several weaknesses with respect to numerical stability that were not found in my testset. Based on the sample you shared, I will dig into this issue in depth. Thanks for sharing.

MVP1996 commented 2 years ago

Thank you for the clarification, it makes a lot more sense now. I look forward to the next tomotopy update!

bab2min / tomotopy

CTModel Numerical Stability #165