A12Studios / semantle-es

Source code for "Semantle en español"
http://semantle-es.cgk.cl
GNU General Public License v3.0
3 stars 0 forks source link

gx: implement "real word" list #12

Open cgallegu opened 2 years ago

cgallegu commented 2 years ago

Top1000 words in the word2vec dataset likely includes a lot of garbage. Semantle-en implements a "real word" list to filter them out. I left that out when generating the db because 1) I didn't have a file at hand 2) didn't understand how it would affect the game experience.

Thanks to @novalis for explaining how not having the file affects the game.

This could make #5 be less pressing.

cgallegu commented 2 years ago

Rae.es has a corpus: http://corpus.rae.es/frec/CREA_total.zip

Check license.

cgallegu commented 2 years ago

Citation format;

http://corpus.rae.es/citar.htm

Looks like citation is good enough.

cgallegu commented 2 years ago

Inspect, saw some one letter consonants in the list.

matiasg commented 2 years ago

this is the first time I play the spanish version. The word was "total" and in position 998/1000 there is "bancroft" which is not a spanish word. Other non spanish words: "shinecock" (989), "waco" (979). Also, the top 20 words have little resemblance to "total". It seems an issue with the dataset; I didn't want to open a new issue because this one seems appropriate.

cgallegu commented 2 years ago

this is the first time I play the spanish version. The word was "total" and in position 998/1000 there is "bancroft" which is not a spanish word. Other non spanish words: "shinecock" (989), "waco" (979). Also, the top 20 words have little resemblance to "total". It seems an issue with the dataset; I didn't want to open a new issue because this one seems appropriate.

Hi, thanks for the report! I'm hearing this same comment from several people. Issue #5 should fix this.

I'll update that issue with the current state of things. There's some progress but still some work needed to be able to roll the fix out without breaking the game for a day.