DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

Common crawl #35

Closed gsingers closed 12 years ago

gsingers commented 12 years ago

Here's the start of a job to convert Common Crawl (http://www.commoncrawl.org) to BehemothDocument. Still needs to be tested, but figured request early, request often.

I'll have an update on this soon, as we need to handle that the input is in gzip.

jnioche commented 12 years ago

Apart from the unzip, doesn't the existing code for WARC files in the IO module do the job already?

gsingers commented 12 years ago

I seem to recall the formats being slightly different. Also, this version relies on the Common Crawl code itself, so I suppose it's guaranteed to support their format. That being said, I am having trouble getting it working w/ Behemoth at the moment due to some conflicts with the Jets3t library.

jnioche commented 12 years ago

Any luck with this? Did you find the source of the JetS3T problem?

gsingers commented 12 years ago

Haven't had time to track it down. I did recently upgrade Hadoop to 1.0.2, but haven't checked the version of JetS3T in that, so maybe it just goes away for me.

jnioche commented 12 years ago

Available as a separate repo https://github.com/DigitalPebble/behemoth-commoncrawl Fully working and tested