dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
851 stars 136 forks source link

Reducing the dimensionality of a dtm #55

Closed zachmayer closed 8 years ago

zachmayer commented 8 years ago

The glove model is great for reducing the dimensionality of a tcm. What would be a good approach for reducing the dimensionally of a dtm matrix? svd (say using the irlba package) is a good approach, but I was wondering if there was a good way to apply glove vectors to a dtm?

dselivanov commented 8 years ago

It depends on task and particular cost function you want. PCA (and SVD) is optimal in the sense of Mean Square Error. I don't know any methods based on GloVe framework. There is some code in glove-python, but I didn't test this approach. You can try PLSA and LDA factorizations. For LDA in R I usually use lda package. Also there is topicmodels, but I found lda is faster/better. Also it worth to try some Non-negative matrix factorization technics... It really depends on task you try to solve.

zachmayer commented 8 years ago

FYI: in the problem I was working on, irlba gave really good results.

dselivanov commented 8 years ago

@zachmayer, can you share any additional details ?