dimazest / google-ngram-downloader

Other
95 stars 25 forks source link

Google published new ngrams, 20200217 #22

Open lahosken opened 3 years ago

lahosken commented 3 years ago

https://storage.googleapis.com/books/ngrams/books/datasetsv3.html . For an URL example, one file of ngrams is at http://storage.googleapis.com/books/ngrams/books/20200217/eng/1-00016-of-00024.gz

7shoe commented 3 years ago

Indeed, I tried to debug. Necessary code changes seem to be restricted to util.py. However, new problems arise. Let me use German v3 2-grams as a reference (version 20200217). The challenges are:

  1. No variable to pick version in the code right now.
  2. Different naming scheme for URLs from which the data is downloaded, see your ...00016-of-00024.gz URL above
  3. Google n-gram v3 line structure seems to have changed as compared to v2.

Changing the code of def iter_google_store(...) in util.py from version = '20120701' to another causes a new bug, the file template doesn't match anymore. Then, it should be FILE_TEMPLATE_GER_NEW = '{ngram_len}-{index}-of-{full_number}.gz' instead of
FILE_TEMPLATE = 'googlebooks-{lang}-all-{ngram_len}gram-{version}-{index}.gz', Commenting out assert len(data) == 4 in the function definition def readline_google_store.

In def iter_google_store(...) we need to get the full_number right for the proper URL of the files. This number depends on the language lang (german in my case) and ngram_len; for that case it is

#version = '20120701'
version = '20200217' # New: v3
session = requests.Session()

# Case-By-Case lookup of total number of {gram_len} grams
if(version=='20200217' and ngram_len==1):
    full_number = '00008'
if(version=='20200217' and ngram_len==2):
    full_number = '00181'
elif(version=='20200217' and ngram_len==3):
    full_number = '01369'
elif(version=='20200217' and ngram_len==4):
    full_number = '01003'
elif(version=='20200217' and ngram_len==5):
    full_number = '02262'
else:
    full_number = 0

Printing the line (old version, 20120701) yields 0 0005_NUM 1901 1 1 which is 4 lines (n-gram, year, count, publication) as asserted in the code and mentioned in the documentation. The 1st line of the new version has 29 entries though. It took me some time to figure out that these are all year/counts/publication triplets, e.g. `1929,1,1', '1930,5,3', etc.

I summed the counts/publications up across years and used the first year of appearance as the year, i.e.

ngram = data[0]
if(version == '20200217' and lang == 'ger'):
       (min_year, count, pubs) = (min([int(data_loc.split(',')[0]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[1]) for data_loc in data[1:]]), 
                                                     sum([int(data_loc.split(',')[2]) for data_loc in data[1:]]))
         other  = [min_year, count, pubs]
# older version (v2/v1) 
else:
        assert len(data) == 4
        other = map(int, data[1:5])

yield Record(ngram, *other)

However, this only happens for the German v3 n-grams (i.e. version = 20200217).

dimazest commented 3 years ago

Thanks for the analysis. I'll have a look what v3 has to offer.

Pull requests are welcome.