ITN Issue (Inverse Text Normalization)

facebookresearch / fastText

Library for fast text representation and classification.

https://fasttext.cc/

MIT License

25.94k stars 4.72k forks source link

ITN Issue (Inverse Text Normalization) #572

Closed limiao2 closed 6 years ago

limiao2 commented 6 years ago

Hi,

I have a specific question about how the training text data are preprocessed, in terms of ITN (Inverse Text Normalization).

For example, if the lexical text is 'give me five thousand dollors', and then the ITN format of it is 'give me $5,000'. Which text do you use for training the FastText word embeddings?

I am curious about it is because that in FastText, I found an embedding vector for the sign '$', and also for '5,000', but no '$5,000'. Do you seperate the sign and number with space in training, like '$ 5,000'? Or is there any other specific text preprocessing that you guys did for the FastText traning?

Thanks!

EdouardGrave commented 6 years ago

Hi @limiao2,

Thank you for asking this question.

The only pre-processing we applied before learning word vectors is tokenization. For our latest word vectors for English, which were trained on News, Wikipedia and Common Crawl, we used the Moses tokenizer (available at https://github.com/moses-smt/mosesdecoder). This tokenizer separates the $ sign from the following number. Hence, your example is tokenized as:

give me $ 5,000

Best, Edouard.