coverified / backend

Backend for the CoVerified Widget
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Language detection in feed crawler #16

Open schliflo opened 4 years ago

schliflo commented 4 years ago

We currently serve all feed entries to users regardless of the article language. This leads to situations where users get served "mixed" content: Screenshot 2020-05-11 at 15 33 23

This could be solved by using some kind of language detection. Ideally the API would provide a language filter argument or language specific endpoints.

johanneshiry commented 4 years ago

From an API perspective providing a language filter isn't that big thing, but I think it is harder to determine the language by the headline when you cannot be sure, that the whole feed offers only one language (which would be very easy to just add a language field in the database).

I'll check if the feeds contain mixed languages and if yes it would make sense to discuss further if we want to spent some time checking for automated language detection features or if we are going to only use single language feeds in the future.

schliflo commented 4 years ago

Maybe this lib is an easy solution for now: https://pypi.org/project/langdetect/

johanneshiry commented 4 years ago

just took a short look but seems promising to me. Dunno if it's worth investigating if we plan to fully overhaul the current backend implementation though ...?!

schliflo commented 3 years ago

@johanneshiry maybe it's feasable to port the language detection logic used in https://github.com/coverified/platform_crawler - we basically only need to filter out all non german entries

johanneshiry commented 3 years ago

any reason why we don't do a full switch to https://github.com/coverified/platform_crawler? Maybe this would make more sense? However, I could also provide small fix here. What's your preferred solution?