SocialHarvest / harvester

The Social Harvest server that exposes an API and harvests data from the web to be analyzed.
Other
111 stars 44 forks source link

Language detection #58

Open tmaiaroto opened 10 years ago

tmaiaroto commented 10 years ago

Many services lie. Well, they don't lie. What happens is the user can report their locale/language in their profile on Twitter lets say. These people could actually speak multiple languages (and post in multiple languages). So then you end up with something that says "en" but is really not English.

Then you have people simply choosing the wrong locale (perhaps on purpose, perhaps not).

Sometimes the language doesn't even come back for certain networks. So you have no clue.

This has led to problems. I've looked for the top hashtags for certain things with the condition of language being "en" and back comes Japanese or something. Often times you'll get a bunch of spam in another language. It gets in the way because it's spam.

So spam detection, fake account detection, that's important. It would be nice to (optionally) skip saving messages from those shady accounts (another ticket). But what is really needed is language detection.

There are various machine learning processes to check for this. I'm not sure I'll need a full blown neural network...But there should be something. Then that way when sorting results by "en" it truly would only show English content.