facebookresearch / fastText

Library for fast text representation and classification.
https://fasttext.cc/
MIT License
25.87k stars 4.71k forks source link

Pre-trained models preproc script normalises away all digits, some quotes #281

Open bittlingmayer opened 7 years ago

bittlingmayer commented 7 years ago

The offending bit of the script is tr 0-9 " ". So '1st' and '3D' are not in wiki.en.vec.

'It won 1st place in the 3D film contest.'
-> 'it won st place in the d film contest .'

Another bug here is the final substitution, which removes '«'. Probably that's because in English it is used for navigation like breadcrumbs.

But in most European languages, it is an ordinary quotation mark. (Opening in some, closing in others.)

'Г. Шмидт, можно сказать «Давай давай!»?'
-> 'г . шмидт , можно сказать давай давай ! » ?'

'Dann stammerte er »Was... was fr a Witz soll des denn sein?«'
-> 'dann stammerte er »was . . . was fr a witz soll des denn sein ?'

This is quite odd.

See also: https://github.com/facebookresearch/fastText/issues/161

bittlingmayer commented 6 years ago

Any update on this? It would be ideal to avoid these issues before the next re-training.