facebookresearch / StarSpace

Learning embeddings for classification, retrieval and ranking.
MIT License
3.94k stars 528 forks source link

starspace not recognizing labels trained by fasttext #251

Open tharangni opened 5 years ago

tharangni commented 5 years ago

I am trying to do a multi label document classification task and I want to use pretrained word vectors from fasttext as my initial model weights.

However, the labels do not get recognized as distinct labels i.e. if i have a label __label__science in my dataset, the __label__ prefix is stripped and a vector is generated only for science - it basically loses information of the fact that it is a label in fasttext.

Therefore, when i try to load such a model into starspace, there are no labels recognized from the pretrained vectors (num labels in model = 0) and my original classification objective becomes obsolete. Any help to get around this problem?

163 was also referred but i don't think it addressed this issue

ledw commented 5 years ago

@tharangni Hi, thanks for reporting. That is not the expected behavior. Did you set the -label parameter to be __label__? In addition, make sure that '-fileFormat' is set to 'fastText'.

tharangni commented 5 years ago

@ledw I did that as mentioned but the problem isn't still resolved.

ledw commented 5 years ago

@tharangni sorry for the delay in replying as it slipped through. I just tried a toy example with something like hello 0.1 0.2 0.3 world -0.1 0.0 0.5 __label__1 0.9 -1.2 -0.5 and the model is able to load 2 words and 1 label. Is your fasttext pretrained embedding of the same format? If you can share with me the pretrained embeddings or the data you used, I can help to look further.