facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.85k stars 4.71k forks source link

classification of new sentence which is not in the trained data #346

Open lucky050619 opened 6 years ago

lucky050619 commented 6 years ago

Example : train.txt labelA sentence1 _labelB_sentence2

if I have trained these two sentences for classifcation model classify(sentence3 which is totally different than both sentence1 and sentence2 no word or semantics in common) gives ----- label A with 60% prob. but i need label as NULL Is there anyway can it be solved ??

creat89 commented 6 years ago

Well, this behavior is expected because the model will try to generalize a vector for documents that have not been seen and do not share any kind of features. This is the power of this classifier, it will not break despite the lack of features in an unseen document.

Despite my question, whether you are really looking for a classifier or a vocabulary verification. I may have a solution for your problem.

Maybe, what you can do, is to add a third label in the training, with an empty sentence. Example:

__labelA__ , ID, sentence1
__labelB__, ID, sentencen2
__labelC__, ID, 

You may want to know why I propose this. Well, I found that including (this was by a mistake) an empty line with a certain label makes the classifier to identify sentences without known features as the label of the empty line.

In other words, it looks that fastText learns that no features is linked to the label related to the empty line. I don't know if in your case this method will work, but it could do it. By the way, I don't know whether 1 empty line would suffice or multiple empty lines would be necessary. And don't forget that all the labels should be in random positions.

lucky050619 commented 6 years ago

Thanks @creat89
It worked for me as it dont have any features to train . How can i use pre trained vectors like word2vec/glove/other to find semantic similarity of two sentences. Example: my training set has label health "What would be a realistic plan to lose weight?"

i have to classify below setence "How can I efficiently lose weight and become fit?"

How to solve this type of semantic similarity queries to clssify ??

spate141 commented 6 years ago

How can i use pre trained vectors like word2vec/glove/other to find semantic similarity of two sentences.

Cosine similarity of sentence vectors can be a good start!

creat89 commented 6 years ago

@balaram-bhukya I do not understand what is the objective of creating a classifier that puts a null label on documents with unseen features and at the same time determining the similarity between two sentences with similar meaning but non common features.

Although the answer of @spate141 is correct, in my opinion you're not using the power and goal of a classifier. My answer (and question) is why don't use the trained classifier to determine the label of "How can I efficiently lose weight and become fit?" ? If the classifier is well trained, it should be able to determine that the unseen sentence belongs to the health class without calculating the similarity between any sentence.

You can include (if I remember well) an external resource of words embedding to improve the text classification done by fastText.