Closed gsingers closed 12 years ago
I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed
On Mar 20, 2012, at 5:26 AM, Julien Nioche wrote:
I was thinking that we should make the language identification pluggable in Tika. Its current resources are both slow and inaccurate and there are better alternatives. We could also add the language detection directly as a Behemoth module indeed
We made Solr's pluggable as well, so I think this makes sense.
Reply to this email directly or view it on GitHub: https://github.com/jnioche/behemoth/issues/34#issuecomment-4591403
Grant Ingersoll http://www.lucidimagination.com
http://code.google.com/p/language-detection/ is a pretty good LangId tool that we also use in Solr.
I have started a discussion in Tika land about making the language detection pluggable but it might take a bit of time. Having a wrapper for this library should not be too difficult and would provide the functionality until we get it from Tika for free.
Cool. It's been a while since I've been in Tika land, but would be great to have it. Naturally, we could very easily add it here, too.
Implemented in module language-id
Implemented in module language-id
Solr has a nice language detection module that is pluggable with detectors from Tika, IBM, etc. It would be nice if we could hook this into the Hadoop side of things