Adding support for Language Identification

Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla

https://smartreader.inre.me

Apache License 2.0

160 stars 36 forks source link

Adding support for Language Identification #16

Open gabriele-tomassetti opened 4 years ago

gabriele-tomassetti commented 4 years ago

Fasttext will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for content that contains multiple languages.

theolivenbaum commented 4 years ago

@gabriele-tomassetti @ftomassetti I'm the maintainer of an open-source C# NLP library that has two models for language detection: https://github.com/curiosity-ai/catalyst/ If you want I can either port the code from there, or add as a dependency to cover this need.

gabriele-tomassetti commented 4 years ago

Thanks for your offer to help on this issue, too. Honestly, I was mostly looking at this issue as an excuse to work on a NLP library, but if your library can do it better and sooner, I see no reason not to use it.

I think we only have two requirements:

we need to add it as a dependency, rather than port the code from there because there is no need to add other code to maintain, if we can avoid it
we need to make sure that the library does not require much space. Otherwise I think we would need to make it a separate nuget package for this functionality

theolivenbaum commented 4 years ago

We could add it as a callback you need to provide, and just add an example on the Wiki of how to use it with Catalyst for example

gabriele-tomassetti commented 4 years ago

That's a really smart idea. I will work on it.