MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
12 stars 5 forks source link

blog/unsupervised-text-classification/ #16

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Unsupervised text classification with word embeddings - Max Halford

Addendum: since writing this article, I have discovered that the method I describe is a form of zero-shot learning. So I guess you could say that this article is a tutorial on zero-shot learning for NLP. I recently watched a lecture by Adam Tauman Kalai on stereotype bias in text data. The lecture is very good, but something that had nothing to do with the lecture’s main topic caught my intention.

https://maxhalford.github.io/blog/unsupervised-text-classification/

xeruf commented 2 years ago

Thank you for this guide! Copying your commands for the english data works, but if I try the German model from https://spacy.io/models the lexemes never have vectors, rendering it useless.

gprvasd commented 2 years ago

Thanks for this nice article. You mentioned that "training a word embedding model from scratch on our documents" can improve the results. I used "from gensim.models import Word2Vec" and prepared my own model: "model = Word2Vec(sentences2, size=100, min_count=5, workers=6, iter=30)". Now I do not know what to do and how can I use "from sklearn import neighbors" because the "nlp" objcet used in this article and gensim model are different objects. Thank you for any suggestions

MaxHalford commented 2 years ago

@gprvasd what I meant by that is that training your own embedding model, rather than using a pre-trained one, might result in better performance. I have actually never done that, so I won't be of much help, but I'm sure you'll find dozens of tutorials to do so elsewhere.

gprvasd commented 2 years ago

Thank you for your response. I could not find usage of gensim Word2Vec and sklearn neighbors class together but I modified your embed() function to produce centroid. It looks working but I am not sure how accurate it is. Thanks

nivi0 commented 1 year ago

This is so nice and perfectly explained. But I can't use it for Unsupervised DNA Sequence Classification right?

MaxHalford commented 1 year ago

Hey @nivi0, cheers! I'm not familiar with DNA sequence classification, so I can't say for sure. I'd be happy to discuss it though.

Wnjoki commented 7 months ago

Thanks for the good explanation.