Savvysherpa / slda

Cython implementations of Gibbs sampling for supervised LDA
MIT License
61 stars 11 forks source link

Non-replicable results #11

Open elcilorien opened 4 years ago

elcilorien commented 4 years ago

Hi, I've run into an issue where the output of the sLDA model (i.e., predicted values, topic assignments, etc) are different when I re-run the exact same code on the exact same input data. My understanding was that if the random seed variable was unchanged, that I should get the same output. This is an issue because I want to be able to go back and use the exact same model to create out-of-sample predictions for a new set of out-of-sample documents. Can you help me figure out what I should be doing to make sure the model doesn't change? Thanks!

bearnshaw commented 4 years ago

Hi @elcilorien, I'm not sure why you are experiencing this. Note that the only randomness is in create_topic_lookup and create_rands, and you can verify that both use seed if it is not None. Are you sure seed is not None?

elcilorien commented 4 years ago

Hi @beanrshaw. It turns out there were actually some differences in the input files, so that was completely my fault. However, I've run into a different issue I was hoping you could help me with... I've been running sLDA models with about 5,000 documents and 150 topics and I keep getting quite negative R-squared values. In one case it was -2. Is this a sign of overfitting of the model (150 is too many topics?) or do I just have no predictive power from my texts? Thanks!