Source of language corpus

Mimino666 / langdetect

Port of Google's language-detection library to Python.

Other

1.72k stars 198 forks source link

Source of language corpus #69

Open DonaldTsang opened 4 years ago

DonaldTsang commented 4 years ago

Where is the source text dataset for the Ngrams of those 55 languages? Would like to see if it is different from https://github.com/wooorm/franc/issues/78 usage of UDHR, and if it is more accurate than them.

Apparently it uses Wikipedia but did not say how.