DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Tika Components #8

Closed gsingers closed 14 years ago

gsingers commented 14 years ago

I've got some components for a MapReduce job for dealing with rich documents and converting them to Behemoth docs.

jnioche commented 14 years ago

Are your components a driver and mapper using the TikaProcessor? We could definitely add them to the tika package in the same way as we have a driver and mapper for the other components. Please clone and send a pull request. Thanks!

gsingers commented 14 years ago

Yes they do use TikaProcessor and should make it easy for people to extend it too, since not everyone will just want text.

I've forked and will push my changes to the fork soon and then submit a pull request