first20hours / google-10000-english

This repo contains a list of the 10,000 most common English words in order of frequency, as determined by n-gram frequency analysis of the Google's Trillion Word Corpus.
Other
3.88k stars 1.93k forks source link

Unclear license #11

Closed l0b0 closed 7 years ago

l0b0 commented 7 years ago

LICENSE.md doesn't actually contain a license, but rather attribution. Is any or all of this material in the public domain, all rights reserved, or something in between?

worldlywisdom commented 7 years ago

Good question.

The original list with word frequencies was published here: http://norvig.com/ngrams/. On that page, Norvig states:

Code copyright (c) 2008-2009 by Peter Norvig. You are free to use this code under the MIT license.

It's not clear to me whether or not the word lists count as "code."

The full original 1T 5-gram corpus is distributed by the Linguistic Data Consortium here: https://catalog.ldc.upenn.edu/LDC2006T13.

The LDC has a license which allows for "limited excerpts from the Data" for "linguistic education and research," which appears to make Norvig's use (and by extension, this repo) acceptable for non-commercial purposes.

Bottom line: if you intend to use this for commercial purposes, I'd recommend getting a license from the LDC for the full corpus. Personal non-commercial use should be okay.