Closed gsingers closed 12 years ago
Hi Julien,
Starting to use this a bit more... commit 17d992b converts the CorpusGenerator to use Paths, so that, in theory, one could have the original content stored in HDFS/S3, etc. and have it convert over to BehDocs.
Also, I took a few liberties in terms of sprinkling in usage of new mapreduce apis as well as various other places. Not sure if it fits w/ where you see the lib going.
-Grant
I should add, I also changed CorpusGenerator to be programmatically accessible instead of having to invoke solely via main().
Also, I used the new map reduce APIs. Not sure if this is a good thing. Probably should stick to the old ones.
Hi Grant,
CorpusGenerator to use Paths = excellent idea! Will have a closer look
Moving to the new API is planned indeed. Why do you think we should stick to the old ones?
Am a bit reluctant to introduce a dependency on Mahout in the IO module but will have a look at the code first.
BTW don't know if you use Eclipse but there is a eclipse-format.xml file that I use for Behemoth.
On Feb 23, 2012, at 5:32 AM, Julien Nioche wrote:
Hi Grant,
CorpusGenerator to use Paths = excellent idea! Will have a closer look
Cool. Glad it is helpful
Moving to the new API is planned indeed. Why do you think we should stick to the old ones?
No reason, other than they are undeprecated. Mahout's Hadoop utils mostly assume new api, so I tend to favor/think in those terms.
Am a bit reluctant to introduce a dependency on Mahout in the IO module but will have a look at the code first.
I'm working on getting the necessary bits decoupled from Mahout into a stand alone jar. We have some useful utilities for dealing with Hadoop.
BTW don't know if you use Eclipse but there is a eclipse-format.xml file that I use for Behemoth.
IntelliJ. I use the same format that we use for Lucene/Solr/Mahout, etc. I'll try to avoid reformatting.
Added in Reporting capabilities to CorpusGenerator, as I need to be able to report back out how many files are added.
Have merged your commits, thanks Grant
This pull adds an IO job that can convert Writables over to sequence files containing Behemoth Docs. I started sprinkling in a bit more of Mahout's Hadoop utilities, as I find it saves a lot of boilerplate code in terms of job creation.