first20hours / google-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Other
3.88k stars 1.93k forks source link

Why are there ~1500 duplicate words here? #6

Closed farzher closed 8 years ago

farzher commented 9 years ago

Shouldn't the list be deduplicated?

kylemcdonald commented 9 years ago

Yes, it looks like 20k has 1470 duplicates, and 10 in the usa file:

$ wc -l < 20k.txt 
   19999
$ sort 20k.txt | uniq | wc -l
   18529
$ wc -l < google-10000-english-usa.txt 
    9999
$ sort google-10000-english-usa.txt | uniq | wc -l
    9989
whitten commented 9 years ago

I don't know. Is it a case sensitivity issue? Does sort or uniq only get one of "this" and "This" ?

farzher commented 9 years ago

It's not a case issue. It's exact duplicates. Check using any random dedupe tool.

image

Apparently Word is in there 9 times

koseki commented 8 years ago

It seems to combine two different sources into 20k.txt.

I'm checking the frequency rankings of this list using 20k.txt, and the result is this.

freq-g20k

The original count_1w.txt shows the straight graph.

freq

shot 186

worldlywisdom commented 8 years ago

Great catch - not sure why the the original source has duplicates. I appreciate the fix.