epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

Should I make some cleaning and pre-processing on the text before prediction? #100

Closed yananchen1989 closed 4 years ago

yananchen1989 commented 4 years ago

Hello, Do I need to make some cleaning work such as removing all punctuations and non-ascii chars, lower entire text etc, or even remove stop words, before using model to infer embedding? I find that if the text is lowered, the prediction of embedding is different from that is not lowered.

My texts are all news scripts.

Thanks.

kaushikacharya commented 4 years ago

@yananchen1989 sent2vec is built upon FastText code. I don't think you need to do any such pre-processing of the text. If removal of non-ascii chars was required then FastText won't be able to build models for non-English languages whose characters are usually not part of ascii char list.

You can have a look at Edouard Grave's answer related to a similar question on https://github.com/facebookresearch/fastText/issues/441

martinjaggi commented 4 years ago

many thanks kaushik! i'm closing the issue