DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Conversion of Sequence Files #32

Closed gsingers closed 12 years ago

gsingers commented 12 years ago

This pull adds an IO job that can convert Writables over to sequence files containing Behemoth Docs. I started sprinkling in a bit more of Mahout's Hadoop utilities, as I find it saves a lot of boilerplate code in terms of job creation.

gsingers commented 12 years ago

Hi Julien,

Starting to use this a bit more... commit 17d992b converts the CorpusGenerator to use Paths, so that, in theory, one could have the original content stored in HDFS/S3, etc. and have it convert over to BehDocs.

Also, I took a few liberties in terms of sprinkling in usage of new mapreduce apis as well as various other places. Not sure if it fits w/ where you see the lib going.

-Grant

gsingers commented 12 years ago

I should add, I also changed CorpusGenerator to be programmatically accessible instead of having to invoke solely via main().

Also, I used the new map reduce APIs. Not sure if this is a good thing. Probably should stick to the old ones.

jnioche commented 12 years ago

Hi Grant,

CorpusGenerator to use Paths = excellent idea! Will have a closer look

Moving to the new API is planned indeed. Why do you think we should stick to the old ones?

Am a bit reluctant to introduce a dependency on Mahout in the IO module but will have a look at the code first.

BTW don't know if you use Eclipse but there is a eclipse-format.xml file that I use for Behemoth.

gsingers commented 12 years ago

On Feb 23, 2012, at 5:32 AM, Julien Nioche wrote:

Hi Grant,

CorpusGenerator to use Paths = excellent idea! Will have a closer look

Cool. Glad it is helpful

Moving to the new API is planned indeed. Why do you think we should stick to the old ones?

No reason, other than they are undeprecated. Mahout's Hadoop utils mostly assume new api, so I tend to favor/think in those terms.

Am a bit reluctant to introduce a dependency on Mahout in the IO module but will have a look at the code first.

I'm working on getting the necessary bits decoupled from Mahout into a stand alone jar. We have some useful utilities for dealing with Hadoop.

BTW don't know if you use Eclipse but there is a eclipse-format.xml file that I use for Behemoth.

IntelliJ. I use the same format that we use for Lucene/Solr/Mahout, etc. I'll try to avoid reformatting.

gsingers commented 12 years ago

Added in Reporting capabilities to CorpusGenerator, as I need to be able to report back out how many files are added.

jnioche commented 12 years ago

Have merged your commits, thanks Grant