What's the difference between Apache Nutch and the commoncrawl fork?

adulau commented 1 year ago

A small question, we were wondering why the divergence between the Apache foundation version of Nutch and common crawl. Is there a plan to merge it back?

sebastian-nagel commented 1 year ago

Improvements written for CC's fork of Nutch are continuously pushed upstream into mainline Nutch but also into related projects, for example crawler-commons.

There are a couple of reasons why we need to keep CC's fork separately:

quickly apply hot fixes or work-arounds
pick selectively from upstream, for example, to ensure that no changes break our post-processing pipeline or that all dependencies conform with the Hadoop version used in our cluster setup

Generally speaking, there are specific needs for CC's crawler which would be difficult to convince the Nutch community to integrate. And yes, there are also features which wait since long to be pushed upstream, mostly because some work is to do, and if it's only good enough documentation. There's a lot to do...

On the other side, Nutch isn't a monolithic web crawler. It's an extensible toolbox with a plugin system and many jobs/components. It is designed to be customized and adapts to various crawling workflows.

Nevertheless, the differences between CC's fork and upstream Nutch aren't big. Nutch has a huge code base - 100k lines of Java code (git ls-files | grep '\.java$' | xargs wc -l). After the last merge of upstream Nutch into the branch cc (0fae6b5, Aug 22), there are only differences in 48 files over 8000 lines of code, but is mostly because of the addition of specific, custom classes. The command git diff 0fae6b5 cc --stat shows the differences.

adulau commented 1 year ago

Awesome answer! Thank you very much.

commoncrawl / nutch

What's the difference between Apache Nutch and the commoncrawl fork? #25