dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

question about glove params and co-occurence matrix #305

Closed fkrauer closed 4 years ago

fkrauer commented 4 years ago

Hi

I just started using text2vec. I am trying to fit the glove model to my (german) text. I have around 28,000 words in my text (pruned). I have two questions:

  1. What value should I use for xmax? I found anything between 10 and 100 on the internet, but I don't know what xmax relates to.
  2. Is it better to use the full running text (i.e. complete sentences), or is it more appropriate to use a pruned text? I made my co-occurrence matrix using only the nouns, because they contain the best information about topics discussed in the book (most verbs/adverbs are repeated often and don't contribute much information).

I tried to vary the parameters of the model, but the predicted similar words (using sim2() with cosine similarity) are not really similar when I check it.

Thanks for any help. Best, Fabienne

dselivanov commented 4 years ago

Hi.

What value should I use for xmax? I found anything between 10 and 100 on the internet, but I don't know what xmax relates to.

I don't know actually. 100 is used in original GloVe paper, seems they empirically set it. My intuition is that it should be set at some quantile (say 95%) of co-occurence histogram distribution.

it better to use the full running text (i.e. complete sentences), or is it more appropriate to use a pruned text? I made my co-occurrence matrix using only the nouns, because they contain the best information about topics discussed in the book (most verbs/adverbs are repeated often and don't contribute much information).

I would go with full text (and then will only use word vectors for nouns if needed). This may better preserve information how words occur together.

I tried to vary the parameters of the model, but the predicted similar words (using sim2() with cosine similarity) are not really similar when I check it.

Is it possible that model overfitted/underfitted? Mb try to start with smaller embedding sizes. I believe this issue can be only resolved with more experimentation...

dselivanov commented 4 years ago

Hope this helped