KomodoOpenLab / TypeOver

TypeOver facilitates typing on your iOS device when using external switch interfaces compatible with VoiceOver.
1 stars 2 forks source link

Unigram frequencies in en_wordlist.xml appear to have been flattened #207

Open tnantais opened 11 years ago

tnantais commented 11 years ago

I'm not sure where the en_wordlist.xml came from, but the spread of unigram frequencies is extremely narrow (most popular word, "the" with frequency 222; 50,000th most popular word, "exude" with frequency 66). This suggests either a very small training corpus, or more likely, some kind of log() flattening function. Flattening the frequencies is acceptable for ordinary unigram prediction since relative ordering is largely preserved, but for our adaptation purposes, we need raw frequencies.