dkpro / dkpro-c4corpus

DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
https://dkpro.github.io/dkpro-c4corpus
Apache License 2.0
50 stars 8 forks source link

NullWritable as mapper's output key in Phase1 may slow things down #14

Closed habernal closed 8 years ago

habernal commented 8 years ago

See https://support.pivotal.io/hc/en-us/articles/202810986-Mapper-output-key-value-NullWritable-can-cause-reducer-phase-to-move-slowly

habernal commented 8 years ago

Confirmed. Runs faster with intermediate compression; no single-reducer bottleneck for certain keys observed (run over entire CommonCrawl)