computermacgyver / redhen_twitter

Processing and summarising Twitter data for RedHen
1 stars 1 forks source link

Finish language detection #1

Closed computermacgyver closed 5 years ago

computermacgyver commented 5 years ago

No additional input needed presently. Just need to do this.

Later tweets have a "lang" identifier detected by Twitter. The accuracy of this field is questionable and it is not available for all data.

I plan to use the Company Language Detection Kit (CLDv2), which is an open source language detection algorithm released by Google. It is what powers the language detection of webpages in Chrome. Previous research has shown it to be fairly accurate on tweets provided URLs, hashtags, and @mentions are removed.

This will give the following output