chattylabs / language-detector

Package to detect the language of a given text (focusing on short "sms" type text used on tweets, facebook, WhatsApp, etc)
Apache License 2.0
11 stars 2 forks source link

source of language corpus #3

Open DonaldTsang opened 4 years ago

DonaldTsang commented 4 years ago

Where is the source text dataset for the Ngrams of those 73 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

danielantelo commented 4 years ago

It is in data/resources which contains thousands of tweets scraped using the script provided in the bin folder.

You could provide the datasets from franc to our scripts and see what they output. We provide it anonymised whatsapp messages in our final implementation as we wanted to detect sms type text, but tweets were working good and is what we provide in the library.

DonaldTsang commented 4 years ago

It cited http://unicode.org/udhr/ as the base for their system