I'm not sure where the en_wordlist.xml came from, but the spread of unigram frequencies is extremely narrow (most popular word, "the" with frequency 222; 50,000th most popular word, "exude" with frequency 66). This suggests either a very small training corpus, or more likely, some kind of log() flattening function. Flattening the frequencies is acceptable for ordinary unigram prediction since relative ordering is largely preserved, but for our adaptation purposes, we need raw frequencies.
I'm not sure where the en_wordlist.xml came from, but the spread of unigram frequencies is extremely narrow (most popular word, "the" with frequency 222; 50,000th most popular word, "exude" with frequency 66). This suggests either a very small training corpus, or more likely, some kind of log() flattening function. Flattening the frequencies is acceptable for ordinary unigram prediction since relative ordering is largely preserved, but for our adaptation purposes, we need raw frequencies.