billthefarmer / gurgle

Fairly simple android word game
https://billthefarmer.github.io/gurgle/
GNU General Public License v3.0
56 stars 18 forks source link

Update French.txt #19

Closed A-Nicoladie closed 2 years ago

A-Nicoladie commented 2 years ago

I saw your recent commit : thanks for adding the French language ! In the file, there are a lot of uncommon words (or words I didn't know), which made the game sometimes complicated and forced to find the word at random. I propose a correction with a list of more "common" words (which can probably be modified).


Based on the "Lexique" v3.83 database (http://www.lexique.org) (140 000 words)

Removal of words with less or more than 5 letters (nblettres=5) Removal of words that contain quotes or dashes Removal of accents / elements (à->a / ç->c / ê->e ...) Removal of duplicates (the one with the highest total frequency of occurrence is kept) Removal of uncommon words (total frequency of occurrence <0.5) or poorly known words (deflem<80)

Result: 3238 words

A-Nicoladie commented 2 years ago

@tatref Do you agree with this?

tatref commented 2 years ago

I agree that a lot of the current words are not very common. I also wanted to dig into Lexique to improve the list ;-)

That seems very good to me!

billthefarmer commented 2 years ago

Good idea! However the original English app has two dictionaries, a sorter one like you have just created for the words to guess, and a longer one with more obscure words for the allowed guesses. I will use your shorter one for the words to guess. If you think the dictionary from Lexique is better than the one I got from Lexica, I could use that for for the longer dictionary.

billthefarmer commented 2 years ago

OK, I have retained the longer French dictionary and merged your PR. I looked at Lexique, but could not work out how to download the French dictionary.

A-Nicoladie commented 2 years ago

You can go to here: http://www.lexique.org/shiny/openlexicon/

Apply all filters you want (it also works with regex like ^[a-zàâæçéèêëîïôœùûüÿ]{5}$ in « word » column) and then download the corresponding file (button at the bottom of the table)

For information purposes, the result of this database is a huge meta-analysis of 218 literary texts and 9474 subtitles of movies or series (more than 64 million words in total), in order to get an overview of the frequency of use of each word (and many many more specifications)

billthefarmer commented 2 years ago

Thank you, that's very useful. An easier regex is ^\w{5}$. However the site is very frustrating, you get disconnected just when you've got what you want on the screen.

A-Nicoladie commented 2 years ago

Be careful with the regex, both don't return the same list. With ^\w{5}$, you don't get the accented characters.
For the disconnection, I'm not sure I understand what you mean... 🤔

billthefarmer commented 2 years ago

Yes you're right that doesn't work, but I didn't get far enough to find out. This is what I meant...

Screenshot 2022-02-26 at 11-11-27 Open Lexicon