utterances-bot commented 3 years ago

Classifying documents without any training data - Max Halford

I recently watched a lecture by Adam Tauman Kalai on stereotype bias in text data. The lecture is very good, but something that had nothing to do with the lecture’s main topic caught my intention. At 19:20, Adam explains that word embeddings can be used to classify documents when no labeled training data is available. Note that in this article I’ll be using word embeddings and word vectors interchangeably. The idea is to exploit the fact that document labels are often textual.

https://maxhalford.github.io/blog/document-classification/

david26694 commented 3 years ago

A similar application is doing NER using word embeddings. If you have to identify the entities person, place and organisation, you can do the following: For each entity, take some examples that you can think of (I don't think the list needs to be very large, but think about some names of people, some places and some organisations) and compute the centroid of the embeddings of these words.

Now you have a word vector representing person, another one representing place and a third one representing organisation. For each word in your corpus, you take its embedding and compute the similarity with each entity vector. If it's very similar to one of them, you can classify that as an entity of that kind.

It's very similar to what you just described, but applied to NER instead of document classification.

MaxHalford commented 3 years ago

Thanks @david26694, that makes a lot of sense! As I mentioned, I'm not very well-versed in NLP, so I'm probably pushing at open doors. But then again it is rewarding to find out about these things by myself.

From what can I tell, the manner in which the winning solutions to the Google Landmark Recognition challenge is somewhat related. Essentially, they use pre-trained neural networks to extract embeddings from images, and then use a nearest neighbours approach to match images with each other.

MaxHalford / maxhalford.github.io

blog/document-classification/ #7

Classifying documents without any training data - Max Halford