DigitalPebble / behemoth

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
Other
281 stars 60 forks source link

WARC converter to allow custom metadata #63

Closed jnioche closed 6 years ago

jnioche commented 6 years ago

similar to what is done by the CorpusGenerator