Closed limiao2 closed 6 years ago
Hi @limiao2,
Thank you for asking this question.
The only pre-processing we applied before learning word vectors is tokenization. For our latest word vectors for English, which were trained on News, Wikipedia and Common Crawl, we used the Moses tokenizer (available at https://github.com/moses-smt/mosesdecoder). This tokenizer separates the $
sign from the following number. Hence, your example is tokenized as:
give me $ 5,000
Best, Edouard.
Hi,
I have a specific question about how the training text data are preprocessed, in terms of ITN (Inverse Text Normalization).
For example, if the lexical text is 'give me five thousand dollors', and then the ITN format of it is 'give me $5,000'. Which text do you use for training the FastText word embeddings?
I am curious about it is because that in FastText, I found an embedding vector for the sign '$', and also for '5,000', but no '$5,000'. Do you seperate the sign and number with space in training, like '$ 5,000'? Or is there any other specific text preprocessing that you guys did for the FastText traning?
Thanks!