Open MVP1996 opened 2 years ago
Hello @MVP1996 , Thank you for sharing your details. First, to briefly answer each question:
smoothing_alpha
is involved in estimating the multivariate normal distribution from the sample. Specifically, look at the following code:
https://github.com/bab2min/tomotopy/blob/d30964ce0610a5e34d3645cfc8c26d99536cac03/src/TopicModel/CTModel.hpp#L76-L93
In the above code, this->alpha
indicates smoothing_alpha
values. Without it, N_k
sometimes becomes 0 resulting in max_uk
being non-computable.For your third question, in burn-in time, estimating multivariate prior distributions is skipped for the faster training. After the burn-in time, estimating begins, and the problem of numerical instability arises.
Also as you found out, since the model in TermWeight.PMI
has more irrational word counts, it yields more inaccurate results.
Taken together, the current version of the implementation of tomotopy
seems to have several weaknesses with respect to numerical stability that were not found in my testset.
Based on the sample you shared, I will dig into this issue in depth. Thanks for sharing.
Thank you for the clarification, it makes a lot more sense now. I look forward to the next tomotopy update!
Hi, big fan of the package, thanks for putting it together - I've had fantastic experiences with LDAModel and HDPModel, but am running into numerical stability issues as I'm trying out the CTModel.
A few quick clarifying questions on initialization :
Here's the numerical issue that I run into : I generate 2500 samples for my posterior, and I set the burn-in to 2000. For my burn-in, the samples are generated quickly (2 minutes for 2000 samples) and there are no issues. Once the burn-in phase is over however, the speed at which samples are generated becomes much slower (20 minutes for 500 samples, or worse) and I frequently encounter the following warning :
D:\a\tomotopy\tomotopy\src\TopicModel\../Utils/TruncMultiNormal.hpp(56): wrong truncation range [0.195418, 0.186098]
I also occasionally encounter :
D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (106): D:\a\tomotopy\tomotopy\src\TopicModel\CTModel.hpp (101): doc.beta.col(9) is -nan(ind)
Failed to sample! Reset prior and retry!
I understand that this comes from the vertical line trick in the sampler from the Mimno (2008) paper, but I'm not sure what I can be doing to mitigate this issue. I've tried varying all of the input parameters, but haven't had any luck consistently resolving this problem. What should I be doing to mitigate this problem? Also, is there any obvious reason why it only occurs after the burn-in phase is complete? This issue essentially renders the CTModel not usable for my application, and I'm very eager to figure out what's going on.
I've also found that if i use the
tw=tp.TermWeight.PMI
option instead of the default, this occurs more frequently.I've included a zip with a reproducible example : it includes both code and a data sample. CTModel Example.zip
Any help would be greatly appreciated. Thanks in advance!