bstewart / stm

An R Package for the Structural Topic Model
Other
397 stars 98 forks source link

searchK generates different "results" table every time #35

Closed towashington closed 7 years ago

towashington commented 7 years ago

Using the same raw dataset and running searchK repeatedly generates different "results" table each time. That is,

search_output = searchK (...) search_output$results

prints different tables each time. Why is it? (It matters because different "results" tables may suggest different choices of K.

Thanks!

bstewart commented 7 years ago

The results table has all the statistics used in the plots (in case people want to automate some decision or otherwise use said results. The results are slightly different every time because its holding out different documents to do the held-out likelihood calculations which will in turn slightly perturb the topic model.

towashington commented 7 years ago

Thank you for the prompt and helpful reply! I suppose the difference you said appears not just in column "heldout," right? In my exercise the columns "exclus" and "semcoh" are different each time too, sometimes not so slightly. Does it have anything to do with whether or not I set the seed before running searchK?

Thanks again!

bstewart commented 7 years ago

Yeah everything changes because the topic model is actually being fit on different documents each time. If your corpus is really large this might not change much but if its smaller it can change a lot more.

If you set the held out seed (an argument in searchK) it will get you the same answer every time. I will say though that if you are getting very different answers every time your corpus is likely too small to be choosing this on purely computational grounds and you may need to focus more on substantive interpretability for topic number choice (which is perhaps a good idea anyway!).

towashington commented 7 years ago

Got it. Thanks for the explanation --- and the great package!