Closed gsingers closed 12 years ago
Apart from the unzip, doesn't the existing code for WARC files in the IO module do the job already?
I seem to recall the formats being slightly different. Also, this version relies on the Common Crawl code itself, so I suppose it's guaranteed to support their format. That being said, I am having trouble getting it working w/ Behemoth at the moment due to some conflicts with the Jets3t library.
Any luck with this? Did you find the source of the JetS3T problem?
Haven't had time to track it down. I did recently upgrade Hadoop to 1.0.2, but haven't checked the version of JetS3T in that, so maybe it just goes away for me.
Available as a separate repo https://github.com/DigitalPebble/behemoth-commoncrawl Fully working and tested
Here's the start of a job to convert Common Crawl (http://www.commoncrawl.org) to BehemothDocument. Still needs to be tested, but figured request early, request often.
I'll have an update on this soon, as we need to handle that the input is in gzip.