Language Identification

DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.

Other

281 stars 60 forks source link

Language Identification #34

Closed gsingers closed 12 years ago

gsingers commented 12 years ago

Solr has a nice language detection module that is pluggable with detectors from Tika, IBM, etc. It would be nice if we could hook this into the Hadoop side of things

jnioche commented 12 years ago

I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed

gsingers commented 12 years ago

On Mar 20, 2012, at 5:26 AM, Julien Nioche wrote:

I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed

We made Solr's pluggable as well, so I think this makes sense.

Reply to this email directly or view it on GitHub: https://github.com/jnioche/behemoth/issues/34#issuecomment-4591403

Grant Ingersoll http://www.lucidimagination.com

gsingers commented 12 years ago

http://code.google.com/p/language-detection/ is a pretty good LangId tool that we also use in Solr.

jnioche commented 12 years ago

I have started a discussion in Tika land about making the language detection pluggable but it might take a bit of time. Having a wrapper for this library should not be too difficult and would provide the functionality until we get it from Tika for free.

gsingers commented 12 years ago

Cool. It's been a while since I've been in Tika land, but would be great to have it. Naturally, we could very easily add it here, too.

jnioche commented 12 years ago

Implemented in module language-id

jnioche commented 12 years ago

Implemented in module language-id