Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
160 stars 36 forks source link

Adding support for Language Identification #16

Open gabriele-tomassetti opened 4 years ago

gabriele-tomassetti commented 4 years ago

Fasttext will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for content that contains multiple languages.

theolivenbaum commented 4 years ago

@gabriele-tomassetti @ftomassetti I'm the maintainer of an open-source C# NLP library that has two models for language detection: https://github.com/curiosity-ai/catalyst/ If you want I can either port the code from there, or add as a dependency to cover this need.

gabriele-tomassetti commented 4 years ago

Thanks for your offer to help on this issue, too. Honestly, I was mostly looking at this issue as an excuse to work on a NLP library, but if your library can do it better and sooner, I see no reason not to use it.

I think we only have two requirements:

theolivenbaum commented 4 years ago

We could add it as a callback you need to provide, and just add an example on the Wiki of how to use it with Catalyst for example

gabriele-tomassetti commented 4 years ago

That's a really smart idea. I will work on it.