Many services lie. Well, they don't lie. What happens is the user can report their locale/language in their profile on Twitter lets say. These people could actually speak multiple languages (and post in multiple languages). So then you end up with something that says "en" but is really not English.
Then you have people simply choosing the wrong locale (perhaps on purpose, perhaps not).
Sometimes the language doesn't even come back for certain networks. So you have no clue.
This has led to problems. I've looked for the top hashtags for certain things with the condition of language being "en" and back comes Japanese or something. Often times you'll get a bunch of spam in another language. It gets in the way because it's spam.
So spam detection, fake account detection, that's important. It would be nice to (optionally) skip saving messages from those shady accounts (another ticket). But what is really needed is language detection.
There are various machine learning processes to check for this. I'm not sure I'll need a full blown neural network...But there should be something. Then that way when sorting results by "en" it truly would only show English content.
Many services lie. Well, they don't lie. What happens is the user can report their locale/language in their profile on Twitter lets say. These people could actually speak multiple languages (and post in multiple languages). So then you end up with something that says "en" but is really not English.
Then you have people simply choosing the wrong locale (perhaps on purpose, perhaps not).
Sometimes the language doesn't even come back for certain networks. So you have no clue.
This has led to problems. I've looked for the top hashtags for certain things with the condition of language being "en" and back comes Japanese or something. Often times you'll get a bunch of spam in another language. It gets in the way because it's spam.
So spam detection, fake account detection, that's important. It would be nice to (optionally) skip saving messages from those shady accounts (another ticket). But what is really needed is language detection.
There are various machine learning processes to check for this. I'm not sure I'll need a full blown neural network...But there should be something. Then that way when sorting results by "en" it truly would only show English content.