facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.17k stars 544 forks source link

Tokenization issue in to-En bilingual dictionaries #182

Open kellymarchisio opened 3 years ago

kellymarchisio commented 3 years ago

Hi all -- fyi, there appears to be a tokenization issue in the *-to-En bilingual dictionaries. We commonly see word, -- where the comma wasn't tokenized away. I see this in de-en, fi-en, it-en, and ru-en, at least.