Better German words - Githubissues

trapicki commented 7 months ago

While practicing, I encountered the strange non-German word "lhr" (L-H-R).

The word list is currently 1500 words long, seems to come from https://www.lingq.com/de/deutsch-lernen-online/courses/836064/top-5000-1-6649638/. The original source of the words is unclear and contains words like the notorious "lhr" and "claire", which is a name and for sure not one of the 1500 most frequent words.

1500 covers for sure a good part for a start, but we can improve on that.

A good corpus with lists of words by frequency can be found at https://wortschatz.uni-leipzig.de/de and word lists at https://wortschatz.uni-leipzig.de/de/download/German

Hint: They have also word lists for tons of other languages.

aradzie commented 7 months ago

Getting a list of common words is a surprisingly tricky problem.

Actually, I took the list of German words from a movie subtitles database. Unfortunately, there are a few problems with this approach:

Spelling mistakes, like the words you mentioned.
US English specific words, such as person names, etc.
Profanity and vulgarity.

I took a look at the corpus that you suggested, and although I don't speak German, I noticed that some words, like corona-virus, are over-represented.

So I took a different route. I downloaded a dozen books from Project Gutenberg, then parsed them to extract a list of words sorted by frequency. Still the result was not ideal. Again, some words, like character names are over-represented. A single novel about Japan (written in German) can have a Japanese name repeated multiple times, contributing to a word frequency.

I think I need help from a native speaker. Can you please take a look at this document and remove the words that you don't like? Maybe you can filter out words quickly with the help from spell-checker.

cunktuskaktus commented 6 months ago

Hello aradzie,

One thing, that I find peculiar... In German names, nouns and nominalization are capitalized, so is the beginnig of a sentence. Why give the option to write german_noun.lower, when it has to be german_noun.title?

I am aware that there is a capitalization feature, but unless you put it on 100%, which doesn't feel natural in flow, there will often enough be a german_noun.lower.

It's all about muscle memory, isn't it?

Cheers

torbengb commented 6 months ago

I checked your document and made some minor changes, like removing very weird person names ("Klothilde"?) and some spelling mistakes (new grammar reduces the use of the ß character).

I notice that some words are capitalized properly but most are not. Is there any pattern to this? Is there a guideline? Should all words be lowercase, or should all words be "proper" case?

cunktuskaktus commented 6 months ago

@torbengb I started to capitalize nouns, but then stopped, it might be superfluous. Also I ran out of time.

torbengb commented 6 months ago

@aradzie I have a feeling that using Guthenberg as a source has the side-effect of getting a corpus of language as it was used a long time ago. In contrast, the corpus from https://wortschatz.uni-leipzig.de/de/download/German as suggested by @trapicki is equally biased toward modern language use, and that's desirable, in my opinion.

I downloaded the 2023 version of the "news" corpus and ended up with about 30000 words after removing anything containing various non-letter characters. Sure, this still contains names and places, but I feel it's much more representative.

Here are my results: Google Sheet

I edited the source file using Notepad++ search and replace (see also Regex101 results):

Remove the leading and trailing columns, leaving only the middle column with the word list itself:
(\d+\t)*(\t\d+)*
Remove any lines that contain anything besides the actual letters (blank) and any lines that contain consecutive uppercase letters (upper):
^(?!([a-zäöüßA-ZÄÖÜ][a-zäöüß]+)$).*
Remove all blank lines:
Notepad++ > Edit > Line operations > Remove empty lines (containing blank characters)

aradzie commented 6 months ago

Finding a good word frequency list is a not an easy task. The ones you can find on the internet are plagued with all kinds of problems:

Word lists without frequencies.
Word lists too short.
Badly formatted word lists (PDF document with pictures).
Licensing issues.
etc

It might be a good idea to build our own word frequency list by scanning a corpus of German text.

@torbengb, I took a look at your spreadsheet. Unfortunately, you only included words without frequencies. I need word frequencies in order to train the phonetic model which generates pseudo-words. I want the list of common words and the phonetic model to be consistent with each other.

Like you I also tried to scan Wortschatz Leipcig, I found it to be heavily biased toward modern Internet slang. The iPhone word is in the top 200 most frequent words. It is also infested with so many English words and USA company and city names, like Apple, Seattle, etc.

I found another large corpus of German text, a cleaned up dump of German Wikipedia. I think it should provide a more neutral list of words, although this corpus comes with its own set of issues. For example, the Sowjetunion word is in the top 1000 most frequent words \_(ツ)_/.

I came to realize that any such list must be manually censored.

I moved the development of word frequency lists to a separate repository. It has a bunch of scripts to scan the texts for most frequent words in different languages. To remove the bad words while scanning we have a few heuristics:

An explicit stop list of words to exclude.
Exclude English words.
Exclude ALL UPPERCASE words
Exclude words without vowels
Exclude words with punctuation characters (dashes, apostrophes)

To preserve capitalization we count word occurrences in different cases. If the Kopf word occurs 10 times as kopf, and 100 times as Kopf, then we keep the latter.

In the coming days I'll complete my scanning. I think I will combine Wortschatz Leipcig as suggested by @trapicki with the Wikipedia dump.

Here is a preview of what I've got so far -- words-de.csv

torbengb commented 6 months ago

@aradzie thank you for reviewing my suggestion and for explaining the details you need. I can tell that your words-de.csv is already a very very good list, much better than what you started with <3

For what it's worth, I have updated the Google Sheet with a second page including the frequencies.

Smart move to have the word lists in a separate repo - I will go check it out! Perhaps I can contribute other languages, too.

May I suggest that your rule "3. Exclude ALL UPPERCASE words" on your checklist could be sharpened to say exclude words with any uppercase after the first letter?

And your rule "5. Exclude words with punctuation" could say exclude words that contain non-letters, which for German would be [a-zäöüßA-ZÄÖÜ] in RegEx terms. This rule should be language-specific, like da-DK=[a-zæøåA-ZÆØÅ] and se-SE=[a-zäöåA-ZÄÖÅ]

aradzie commented 6 months ago

@torbengb These are good suggestions, I have updated the corpus project with better word filters.

I've seen your spreadsheet. Where the data is coming from? Can I use this data in keybr as a word frequency list? Or should I proceed with scanning the Wikipedia dump?

--- edit ---

I far as I understand, your spreadsheet comes from https://wortschatz.uni-leipzig.de/de, right? It has to be censored carefully. Check yourself:

832,Putin,17
837,Trump,17
860,Apple,16

I don't think we want the word list to be a history lesson or a news article. We want it to be simple contemporary every day language.

aradzie commented 6 months ago

One final touch is to pass the resulting list of words through aspell to remove all unknown and extraneous words.

So now we have a list of words from a Wikipedia dump:

that come from a dictionary
with proper frequencies
and proper capitalization

I think we can close the issue.

aradzie commented 4 months ago

This is just reminder to myself, and to anyone else who is trying to add new languages, that a good corpus of modern, simple, everyday language can be found in the OpenSubtitles database. These are available in many spoken languages.

The only caveat is that the corpus must be censored manually, because it contains a lot of profane and vulgar language.

aradzie / keybr.com

Better German words #99