DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Mahout : add Lucene Tokenisation #20

Closed jnioche closed 12 years ago

jnioche commented 13 years ago

The Lucene Tokenisation has been replaced with annotations type/value taken from the Behemoth docs. It would be good to add the Lucene Tokenisation back as in the original Mahout class so that users who need Behemoth mostly for converting from Nutch or parsing with Tika don't need to use the GATE or UIMA modules just for getting tokens

jnioche commented 12 years ago

done