dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Roadmap for text2vec 0.4 #91

Closed dselivanov closed 7 years ago

dselivanov commented 8 years ago

First post will be regularly updated. Please Write your suggestions in comments.

In this issue I will aggregate all thoughts about 0.4 release - see 0.4 Milestone for related topics.

GloVe improvements

There are several choices for current framework.

  1. Without word embeddings - factorizations of DTM. (not so new-fangled as deep learning, but usually more useful):
  2. With word embeddings
    • [ ] sum/average/weighted average of word vectors (trivial - dtm_matrix %*% word_vectrors_matrix)
    • [ ] investigate to Fastsent embeddings, see #73
    • [ ] investigate to paragraph vectors from glove-python

      Similarity

    • [x] relaxed word mover's distance #92

      POS tagger

    • [ ] see #77. It would be great if someone will take a shot. postpone to next releases

      Phrase detection

    • [ ] ~~see #99 ~~

      Licence

    • [x] update licence to GPL (>= 2).
pommedeterresautee commented 8 years ago

Regarding document vector, for LSA, Rspectra is much more rapid than irlba on my dataset and works out of the box. Not tried LDA but from my understanding, text2vec is already compliant with LDA package.

So, I would say nnet document vector would add more value to the package. Never tried Fastsent, but seems to provide interesting results according to comments and papers.

The issue regarding POS tagger is that to be interesting it needs to support several languages. And it s quite complex (may be supporting the new Universal format from Standford NLP would be interesting to leverage several languages).

dselivanov commented 8 years ago

Not tried LDA but from my understanding, text2vec is already compliant with LDA package.

Thx for thoughts. I kept in mind lda package API and data structures while text2vec development. Here I want to create single/unified interface for document vectors.

zachmayer commented 8 years ago

I'd just like to add a 👍 for fastsent. I can do SVD via irlba myself, but I still have to jump over to python for fastent. =D

Personally, I found fastsent vectors to be extremely useful in the home depot kaggle competition. They're the only form of "document vectors" I've every found to be good (doc2vec in gensim is terrible).

lmullen commented 8 years ago

Even though text2vec is compatible with the lda package, I think that having an LDA implementation in text2vec would be good. There could be a lot of improvement to the interface.

The wordVectors package by @bmschmidt has a nice interface for working with word2vec vectors, as explained here. I think a similar interface for working with GloVe vectors from text2vec would be a good addition.

TommyJones commented 8 years ago

You probably don't need to look into textmineR for LSA or LDA. textmineR's functions for LSA and LDA with gibbs sampling are just wrappers for irlba and lda, which you're already looking into. So no need to double test. The LDA implementation with variational expectation maximization (VEM) and correlated topic models both come from the topicmodels library. textmineR is really just a wrapper so users can give dgCMatrix as input and get similarly-formatted output instead of learning the syntax and object structures of many libraries.

If you're considering your own implementation of LDA, may I make a couple suggestions?

There isn't any R implementation (yet) that allows for asymmetric (vector) priors. (I butchered the lda C code to get asymmetric priors a couple years ago, but never saw it through to production. (Happy to share the code I have, if you want it. I have it in a "someday" pile to fix and put into Rcpp.)

Also, there is a paper about distributed Gibbs sampling for LDA that is much faster and the authors claim gives a guarantee of convergence. https://papers.nips.cc/paper/3330-distributed-inference-for-latent-dirichlet-allocation.pdf I was thinking RcppParallel might be a good place to make that happen. There are distributed LDA implementations in Gensim and Mallet. The mallet https://cran.r-project.org/web/packages/mallet/index.html package in R is a wrapper for Mallet, but it relies on Java.

On Fri, Apr 22, 2016 at 9:37 AM Lincoln Mullen notifications@github.com wrote:

Even though text2vec is compatible with the lda package, I think that having an LDA implementation in text2vec would be good. There could be a lot of improvement to the interface.

The wordVectors https://github.com/bmschmidt/wordVectors package by @bmschmidt https://github.com/bmschmidt has a nice interface for working with word2vec vectors, as explained here http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html. I think a similar interface for working with GloVe vectors from text2vec would be a good addition.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/dselivanov/text2vec/issues/91#issuecomment-213430742

pommedeterresautee commented 8 years ago

The way Fastsent works is close to w2v, may be the easiest thing is to take C source code and adapt it? And if you take the last verion of w2v we will have support for doc2vec too :-)

dselivanov commented 8 years ago

@pommedeterresautee what C version do you mean?

pommedeterresautee commented 8 years ago

the one from Mikolov. There was a second version supporting paragraph2vec. It was distributed through the forum of Google deposit. It was supposed to be a kind of beta.

dselivanov commented 8 years ago

@pommedeterresautee, personally I don't think it will be easy to embed original C code into text2vec or separate package. All the wrappers of original code which I saw were simple, not customisable and hard to maintain (because original word2vec contains dirty hacks here and there).

pommedeterresautee commented 8 years ago

May be this version is easier to customize? https://bitbucket.org/yoavgo/word2vecf

(after all it s all about custom context like CBOW + several words to predict like skip gram)

pommedeterresautee commented 8 years ago

A feature easy to implement and very useful, whatever the model is (tf idf, Glove, LSA...): phrase collocation!

Gensim code is quite simple and based on the formula used by Mikolov in its W2V paper. Gensim doc: https://radimrehurek.com/gensim/models/phrases.html

However, there are many ways of computing it.

Other interesting sources: http://www.nltk.org/howto/collocations.html http://www.nltk.org/_modules/nltk/collocations.html http://nlp.stanford.edu/fsnlp/promo/colloc.pdf

Edit: after more testing, the results are not that good. lots of the_something / a_something :-(

dselivanov commented 8 years ago

I the near future I won't have a lot of time for development. So I want to release 0.4 soon. It already has a lot of useful features (and important redesign of create_* functions which now do not modify input iterators). Dear watchers, please drop me few lines if you have something to add.

lmullen commented 8 years ago

@dselivanov What date are you thinking of releasing 0.4? If I have time before that data I'll see if I can review the documentation like I did for 0.3, but if it's in the next week or so I won't be able to.

dselivanov commented 8 years ago

@lmullen I don't have special constraints on time, so can wait you. It would be awesome if you will be able to check docs. Also, I want to add, that all the models at the moment implemented as closures. But since I already introduced R6 as dependency, I want to convert them to R6 classes (and mb add documentation, but didn't know yet how roxygen2 can be used with R6).

lmullen commented 8 years ago

Okay. I'll do my best to get to this as soon as I can. It will probably take a couple weeks.

On Monday, August 1, 2016, Dmitriy Selivanov notifications@github.com wrote:

@lmullen https://github.com/lmullen I don't have special constraints on time, so can wait you. It would be awesome if you will be able to check docs. Also, I want to add, that all the models at the moment implemented as closures. But since I already introduced R6 as dependency, I want to convert them to R6 classes (and mb add documentation, but didn't know yet how roxygen2 can be used with R6).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/91#issuecomment-236576504, or mute the thread https://github.com/notifications/unsubscribe-auth/AALNeDlkuhHy0uT8CBNyZbdN6xIfmKkTks5qbfDbgaJpZM4INcX8 .

Lincoln Mullen Assistant Professor, Department of History & Art History George Mason University

dselivanov commented 7 years ago

Release date for 0.4 is 2016-10-03.

Looking forward for PR, doc refinements, etc.