No additional input needed presently. Just need to do this.
Later tweets have a "lang" identifier detected by Twitter. The accuracy of this field is questionable and it is not available for all data.
I plan to use the Company Language Detection Kit (CLDv2), which is an open source language detection algorithm released by Google. It is what powers the language detection of webpages in Chrome. Previous research has shown it to be fairly accurate on tweets provided URLs, hashtags, and @mentions are removed.
This will give the following output
cld_reliable -- does CLD consider its language detection reliable? (true|false)
cld_bytes -- the number of bytes of text
cld_lang1 -- the most prevalent detected language
cld_lang1_percent -- the percentage of cld_bytes detected to be in the most prevalent language
cld_lang2 -- the second most prevalent detected language
cld_lang2_percent -- the percentage of cld_bytes detected to be in the second most prevalent language
No additional input needed presently. Just need to do this.
Later tweets have a "lang" identifier detected by Twitter. The accuracy of this field is questionable and it is not available for all data.
I plan to use the Company Language Detection Kit (CLDv2), which is an open source language detection algorithm released by Google. It is what powers the language detection of webpages in Chrome. Previous research has shown it to be fairly accurate on tweets provided URLs, hashtags, and @mentions are removed.
This will give the following output