facebookresearch / MUSE

A library for Multilingual Unsupervised or Supervised word Embeddings
Other
3.18k stars 544 forks source link

There was a bug in processing the vocab ? #139

Open YankeeMarco opened 5 years ago

YankeeMarco commented 5 years ago

grep -E "^有问题" wiki.multi.en.vec 有问题吗 -0.0722421 0.003276 0.0232957 0.0377831 -0.0524567 -0.0871994 0.0290384 -0.0429422 -0.0542887 0.0547511 -0.0975471 0.000825248 -0.0330909 -0.0819634 0.0410098 0.0373118 -0.00135475 0.000818792 0.14323 0.00739884 -0.00820092 0.0452824 0.0288035 0.0637914 -0.122232 -0.0123121 0.00424665 -0.0311599 -0.0598393 0.0196687 -0.0665083 0.0142472 -0.0301036 0.0199317 -0.0595084 0.079112 -0.0528335 0.0443886 -0.00980627 0.00606932 -0.0338872 -0.0829769 -0.00788328 0.0687998 0.0213559 -0.0165972 0.017098 5.34953e-05 0.0442527 -0.0719274 -0.017324 -0.0745483 -0.0461659 -0.110819 -0.0414707 -0.107673 -0.0431018 0.00167493 0.00362319 0.118791 -0.05303 0.018048 0.0548915 0.0210722 -0.0687746 -0.0310432 0.06937 0.0370799 -0.0270513 -0.0415062 -0.04227 0.00212067 0.0233356 0.0213943 0.0297549 0.17163 0.00794518 0.0614644 -0.0379235 -0.0344915 0.0479772 0.0878937 0.0221271 -0.0101811 0.0258886 0.00243166 0.0825352 -0.107217 0.0412344 0.0105162 0.0598821 0.0263598 0.026973 -0.0641134 -0.0371301

sabetAI commented 5 years ago

Not the only bug, there are also duplicate tokens (ie for ','), and words with a comma at the end are treated as seperate tokens (ie 'life' and 'life,').