commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
26 stars 2 forks source link

What's the difference between Apache Nutch and the commoncrawl fork? #25

Closed adulau closed 12 months ago

adulau commented 1 year ago

A small question, we were wondering why the divergence between the Apache foundation version of Nutch and common crawl. Is there a plan to merge it back?

sebastian-nagel commented 12 months ago

Improvements written for CC's fork of Nutch are continuously pushed upstream into mainline Nutch but also into related projects, for example crawler-commons.

There are a couple of reasons why we need to keep CC's fork separately:

Generally speaking, there are specific needs for CC's crawler which would be difficult to convince the Nutch community to integrate. And yes, there are also features which wait since long to be pushed upstream, mostly because some work is to do, and if it's only good enough documentation. There's a lot to do...

On the other side, Nutch isn't a monolithic web crawler. It's an extensible toolbox with a plugin system and many jobs/components. It is designed to be customized and adapts to various crawling workflows.

Nevertheless, the differences between CC's fork and upstream Nutch aren't big. Nutch has a huge code base - 100k lines of Java code (git ls-files | grep '\.java$' | xargs wc -l). After the last merge of upstream Nutch into the branch cc (0fae6b5, Aug 22), there are only differences in 48 files over 8000 lines of code, but is mostly because of the addition of specific, custom classes. The command git diff 0fae6b5 cc --stat shows the differences.

adulau commented 12 months ago

Awesome answer! Thank you very much.