first20hours / google-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Other
3.88k stars 1.93k forks source link

Frequency Fail #1

Closed trans closed 11 years ago

trans commented 11 years ago

I have a hard time believing "information" is more frequent than "when".

Also, there are numerous entries for single letters like "x" and state abbreviations like "sd", IMO are not useful entries.

first20hours commented 11 years ago

You'll have to ask Peter Norvig (http://norvig.com/) about that: it's his data. He's the director of research at Google and a careful and trustworthy guy, so I trust the data. Here's the original source if you're interested:http://norvig.com/ngrams/count_1w.txt. Linked from this page: http://norvig.com/ngrams/

Re: letters and state abbreviations - you're more than welcome to take them out if you like... it's not that hard. I'm using this list for typing training, so I left them in.