DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Upgrade to Mahout 0.9 #52

Closed lewismc closed 9 years ago

lewismc commented 9 years ago

A current limitation for using Mahout is for Hadoop 0.20.203 to be installed. This is a Mahout specific dependency. I propose to upgrade to Mahout 0.9 as this will enable use of Hadoop 1.2.1 as per the current Hadoop dependency in Behemoth.

lewismc commented 9 years ago

Hi Julien, I am working on this. Two issues I am having are as follows

I'm trying to find some of this stuff out over on user@mahout

jnioche commented 9 years ago

Hi Lewis. I knew Mahout 0.9 was not a straightforward upgrade. Unless you have a specific need for Mahout with Behemoth I wouldn't bother too much. Mahout's scope has changed a lot and IMHO they've given up trying to compete with Spark MLib, which makes a lot of sense. I was considering removing the Mahout module altogether and instead find a way of generating an output that could be ingested by Spark.

I haven't had the time to work on Azazello as much as I'd love to. I expect it would replace Behemoth and would then interact nicely with the Spark Mlib stuff.

Of course if you definitely need Mahout 0.9, I'd be happy to keep the Mahout module a bit longer.

Thanks

lewismc commented 9 years ago

I'm going to finish the upgrade then I will most likely also move to learning more of Azazello. Thanks for the reference.

jnioche commented 9 years ago

thanks. You'll find that Azazello is pretty empty, hopefully it will get worked on at some point