Closed yananchen1989 closed 4 years ago
@yananchen1989 sent2vec is built upon FastText code. I don't think you need to do any such pre-processing of the text. If removal of non-ascii chars was required then FastText won't be able to build models for non-English languages whose characters are usually not part of ascii char list.
You can have a look at Edouard Grave's answer related to a similar question on https://github.com/facebookresearch/fastText/issues/441
many thanks kaushik! i'm closing the issue
Hello, Do I need to make some cleaning work such as removing all punctuations and non-ascii chars, lower entire text etc, or even remove stop words, before using model to infer embedding? I find that if the text is lowered, the prediction of embedding is different from that is not lowered.
My texts are all news scripts.
Thanks.