enz / german-wordlist

German wordlist for Tanglet and other wordgames.
Creative Commons Zero v1.0 Universal
22 stars 4 forks source link

Is this really limited to the German alphabet? #1

Closed tube42 closed 4 years ago

tube42 commented 4 years ago

I am trying to add this to my open source word list ( https://gitlab.com/tube42/wordlists/ )

I was assuming the alphabet would be English + ßäöü but it seems to be much larger than that:

abcdefghijklmnopqrstuvwxyzäöüßàáâåçčéèêēëīíïîłñōóõœšūûú

Can I safely ignore words including àáâåçčéèêēëīíïîłñōóõœšūûú?

enz commented 4 years ago

There are many words in German that use foreign letters. How you want to handle this depends on your game.

In Tanglet, only the Latin alphabet is used for German words with the common transcriptions used in German crossword puzzles (ä→ae, ö→oe, ü→ue, ß→ss) and by dropping all diacritics in foreign letters. But these transcriptions are done in the program when reading the word list. Tanglet also uses the original unmodified spellings when adding lookup links to Wiktionary in its word solution lists.

tube42 commented 4 years ago

Just checked the files and of 447863 unique words 469 contain foreign letters. I guess dropping them would be acceptable.

I doubt however that all letters in àáâåçčéèêēëīíïîłñōóõœšūûú are used, I think its pretty much é.

d-mal commented 4 years ago

I doubt however that all letters in àáâåçčéèêēëīíïîłñōóõœšūûú are used, I think its pretty much é.

It actually looks like they all are used. I have not tried to check all the words ;-) but most often these are foreign words that are still sometimes used in German, without changing spelling (except for capitalization of nouns). Most of these words are also to be found at "Duden" https://www.duden.de/woerterbuch which is the closest thing there is to an official German dictionary.

A huge number of foreign words in the list is from English, but of course they only have "normal" letters. The majority of foreign accented letters are indeed from French words (e.g. Crêpes, Protegé, Noël, Œuvre, Garçon...). Also names of languages or ethniticies and some related adjectives show up. The only polish word in the list is "Złoty". Many but not all of these words are in the list twice, once with accents and once without.

ē only shows up in "Nasobēm" and derivations. That word was invented by a poet, and is indeed in "Duden", but with e instead of ē.

You could definitely drop these words, but you could also just "normalize" the accented letters which is what is done in German crossword puzzles.

d-mal commented 4 years ago

(ETA: I'm a native German speaker.) I think most of the foreign words in the list are indeed used in the German language. Some are very common (Café and the related compund words), others are probably used only in universities, many sound a bit stuffy and formal, others are old-fashioned but not yet forgotten.