The top words are very similar after 5-6 epochs

yg37 commented 8 years ago

I was rerunning the script for 20_newsgroup and this is the topic term distribution after 1 epoch. From the picture, we can see that the top words for each topic are actually very similar. Is it normal or were I implementing something wrong?I encountered the same issue when I ran the script on other corpus. After 10 epochs, the top words were almost identical with top words being "the","a", etc.

cprevosteau commented 8 years ago

I have the same problem. I runned the script of the 20_newgroup on either the original corpus or on one of my own, and after just one epoch the topics' top words are identical. I tried to change any hyperparameter but the results were the same.

`Top words in topic 0 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 1 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 2 invoke out_of_vocabulary out_of_vocabulary the . to , a i

Top words in topic 3 invoke out_of_vocabulary out_of_vocabulary the . to , a i

...

Top words in topic 19 invoke out_of_vocabulary out_of_vocabulary the . , to a i`

yg37 commented 8 years ago

I had it running on the server last night and the top words diverged after around 20 epochs. Not sure why the initial topic term distribution behaves that way, maybe it has something to do with the prior?

agtsai-i commented 8 years ago

I consistently have out_of_vocabulary as the top word across all topics, any suggestions on what I should look for? This happens even when I set the min and max vocab count thresholds to None.

radekrepo commented 8 years ago

Hi all,

From my experience you can set a more aggressive down-sampling rule to remove the out_of_vocabulary or equally redundant tokens from at least some of the topics if not all. I have lowered the down-sampling threshold in my dataset and the stop words largely disappeared from top topic word lists. An alternative to that, which I haven't tried, is to carry out data cleaning before you feed the tokens to the model. That way you can remove the out_of_vocabulary token as well as other meaningless tokens from modelling entirely. Data cleaning could possibly lead to improved results (it does in case of a pure LDA at least) although I don't know the maths behind the lda2vec model well enough to make a strong case for that.

I personally eventually gave up on using lda2vec because each time you use it, the model requires a lot time to fine tune the topic results. Standard word2vec or text2vec with some form of unsupervised semantic clustering are probably a less time-consuming alternative to lda2vec because they can work regardless of the dataset or a type of computer system you use, apart from mere fact that model optimisation may itself work more quickly. Moreover, lda2vec was a real pain to install on my Windows a couple of months ago. lda2vec may be useful but you should have very specific reasons for using it.

agtsai-i commented 8 years ago

Thanks @nanader! I'll play with the down-sampling threshold. I believe I had removed the out_of_vocabulary tokens entirely by setting the vocab count thresholds to None (at least, that's what my reading of the code tells me should happen so far), and so I was surprised to still see them pop up.

So far I've tried doc2vec and word2vec + earth mover's distance, but have not had stellar results so far. I like the approach used here for documents (in principle) more than the other two, and of course the given examples look amazing. I'd really like lda2vec to work out with the data I have.

I installed lda2vec on an AWS GPU instance, and that wasn't too horrible.

yg37 commented 8 years ago

I recently tried topic2vec as an alternative:
http://arxiv.org/abs/1506.08422 https://github.com/scavallari/Topic2Vec/blob/master/Topic2Vec_20newsgroups.ipynb I tried it on simple wiki data and it performed very well

agtsai-i commented 8 years ago

Oh interesting, thank you!

radekrepo commented 8 years ago

Ah, by the way, agtsai-i. You can also use the vector space to label topics with the nearest cosine distance token vectors instead of relying on the most common topic-word assignments. lda2vev model results allow for it. That way you could ignore the tuning of the topic model results entirely and get as many or as few topics as you want. It depends on what you want to achieve, really. I hope that helps

agtsai-i commented 8 years ago

True, but if I did that I would be discarding a lot of the novelty of lda2vec and essentially just using word2vec right?

*Never mind, I see what you're saying. Much appreciated

gracegcy commented 7 years ago

Hi @agtsai-i & @yg37 did you resolve this issue in the end? Could you kindly share the solution if any? Thanks a lot.

yg37 commented 7 years ago

@gracegcy Keep running the epochs and the top words will diverge

ghost commented 5 years ago

@radekrepo @agtsai-i @yg37 Have you noticed that the result of the down-sampling step was never used? No wonder I kept getting a lot of stop words (OoV, punctuations, etc.) however I lowered its threshold. I put my efforts for solving this here: https://github.com/cemoody/lda2vec/issues/92

cemoody / lda2vec

The top words are very similar after 5-6 epochs #37