Open DonaldTsang opened 4 years ago
It is in data/resources
which contains thousands of tweets scraped using the script provided in the bin
folder.
You could provide the datasets from franc to our scripts and see what they output. We provide it anonymised whatsapp messages in our final implementation as we wanted to detect sms type text, but tweets were working good and is what we provide in the library.
It cited http://unicode.org/udhr/ as the base for their system
Where is the source text dataset for the Ngrams of those 73 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.