Open cgallegu opened 2 years ago
Rae.es has a corpus: http://corpus.rae.es/frec/CREA_total.zip
Check license.
Inspect, saw some one letter consonants in the list.
this is the first time I play the spanish version. The word was "total" and in position 998/1000 there is "bancroft" which is not a spanish word. Other non spanish words: "shinecock" (989), "waco" (979). Also, the top 20 words have little resemblance to "total". It seems an issue with the dataset; I didn't want to open a new issue because this one seems appropriate.
this is the first time I play the spanish version. The word was "total" and in position 998/1000 there is "bancroft" which is not a spanish word. Other non spanish words: "shinecock" (989), "waco" (979). Also, the top 20 words have little resemblance to "total". It seems an issue with the dataset; I didn't want to open a new issue because this one seems appropriate.
Hi, thanks for the report! I'm hearing this same comment from several people. Issue #5 should fix this.
I'll update that issue with the current state of things. There's some progress but still some work needed to be able to roll the fix out without breaking the game for a day.
Top1000 words in the word2vec dataset likely includes a lot of garbage. Semantle-en implements a "real word" list to filter them out. I left that out when generating the db because 1) I didn't have a file at hand 2) didn't understand how it would affect the game experience.
Thanks to @novalis for explaining how not having the file affects the game.
This could make #5 be less pressing.