dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Combine vocabularies #260

Closed illusive-git closed 6 years ago

illusive-git commented 6 years ago

Hello,

I suggest allowing a document class (or category) dependent pruning of vocabularies. Example: I have document multi-class classification problem at hand that is highly imbalanced n(class a ) = 100, n(class b) = 1,000, n(class c) = 50,000 If I now simply prune by term_count_min = 100 or doc_count_min = 100, I could eliminate the single perfect identifying word in class a that's occurs nowhere else. Ideally, I could specify a minimum proportion in class, so any word that's in the bottom 10% of all classes can be pruned.

From the code and my limited cpp knowledge I assume this would need to be changed not in prune_vocabulary but already create_vocabulary with a per class word count? I'd love to contribute but cpp is not my field.

Thanks for great package!

dselivanov commented 6 years ago

you can create vocabulary for each of the classes using corresponding subset of documents. Then prune them separately and combine into a single vocabulary with v = text2vec:::combine_vocabulary(v1, v2, v3).

illusive-git commented 6 years ago

Perfect. Thanks a lot for the quick answer!!! Keep up the great work.

pommedeterresautee commented 6 years ago

As a reminder and for others, if your are using combine_vocabulary, don t forget to re set stopword and sep_ngram attributes, they are removed during the merge process. @dselivanov is there a reason for that?

dselivanov commented 6 years ago

Should not be the case! I will check

вт, 22 мая 2018 г., 10:16 Michaël Benesty notifications@github.com:

As a reminder and for others, if your are using combine_vocabulary, don t forget to re set stopword and sep_ngram attributes, they are removed during the merge process. @dselivanov https://github.com/dselivanov is there a reason for that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dselivanov/text2vec/issues/260#issuecomment-390875267, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3WV08xA5pcmSlVpw18Zilz3yoN4Rks5t0608gaJpZM4T_JRg .

dselivanov commented 6 years ago

Ok. Now I see what happens. combine_vocabulary is not public API and used mainly internally with parallel itoken iterators. So logic about stopword and sep_ngram attributes is inside create_vocabulary. For me it is not clear how to combine this attributes from several vocabularies in general. For stopwords we can take union, but what about sep_ngrams if they are different? Any thoughts @pommedeterresautee ?

illusive-git commented 6 years ago

I would also suggest to allow to combine NULLs and dictionaries, makes procedural creating of combined vocabularies easier. I tried using empty vocabularies with create_vocabular(NULL) but this had some errors, I'll check it later.

pommedeterresautee commented 6 years ago

First thing this function is very useful for very large datasets. I am working on bigrams with a very large datasets and it's impossible to build vocabulary in a first step and prune it after, not enough memory. So I build vocabulary on parts of the dataset apply a light pruning, merge and then prune it with real parameters to get the best trade off between quality and size. For what it worths, in my case, quality is slightly degraded with hashing trick and size after pruning (done manually, missing an API there) on hashing trick is bigger than real vocabulary (collision...).

For these reasons I think this function should be public. It took me time to discover it :-)

Because it should be public it should manage stop list and separator. I would just document that it keeps the stop list and separator of the first parameter. After all it doesn't make sense to merge 2 vocabs built with different stop list and separators and it's quite easy to do update them so I would just document the way it works. You just need a previsible function. No magic inside.

dselivanov commented 6 years ago

Reopen as a reminder to document it and make public

pommedeterresautee commented 6 years ago

in the same spirit, have you an Idea on how to apply tf idf in a scalable way?

dselivanov commented 6 years ago

@pommedeterresautee it is possible - at the end we only count many documents contain each word. So it will be possible to add partial_fit method. Please open separate issue for that.