apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
883 stars 261 forks source link

Should we normalize further the URLs #120

Closed GuiForget closed 8 years ago

GuiForget commented 9 years ago

Based on this page http://en.wikipedia.org/wiki/URL_normalization, I was wondering if should update BasicURLNormalizer to implements the first 2 sections.

kkrugler commented 9 years ago

I believe the Nutch URL normalizer already does many of these. As well as what's in Bixo. And (of course) I'd recommend pushing that support down into crawler-commons :) No sense in having 8 versions of the same basic functionality.

jnioche commented 9 years ago

+1 to having that in BasicURLNormalizer via a standalone class that we could donate to crawler-commons. Our definition of a (URLFilter)[https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/storm/crawler/filtering/URLFilter.java] is probably richer than what we'd need in CC so a simple public static String normalize(URL url) would probably do.

jnioche commented 9 years ago

Just found about the handy normalise method in URI. We should definitely add this one-liner to the normaliser

kkrugler commented 9 years ago

I've got a minor concern about using URL for the normalize() method parameter, and using the URI.normalize() method. When we were doing big web crawls, having to instantiate a URL class for each URL wound up generating a lot of GC activity.

This was typically only an issue (as a percentage of all object creation) when we had a map task that was just doing something simple with a URL...we'd get it as text, then have to create the URL to process it. So we wound up creating our own custom class to implement some of the required functionality (extract query parameters, get the domain, etc).

Not sure if that's enough of a concern here, but when processing billions of anything it's easy to run into unexpected bottlenecks.

jnioche commented 9 years ago

We could have a simple approach to begin with (using URI.normalize) and a more efficient one later on if it really becomes an issue. In the context of storm crawler, with all the stuff happening at the same time, I don't think the overhead of using URIs will be very noticeable. As you pointed out this might be different for more specific use cases if we do that in CC.

BTW Nutch has code based on ORO to do that [https://github.com/apache/nutch/blob/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java]. We could benchmark it against using URI.normalize but I am pretty sure the latter will be faster, let alone easier to maintain and read. Of course, this does not mean it can't be done better without URIs.

jnioche commented 9 years ago

FYI see [https://issues.apache.org/jira/browse/NUTCH-1990]

sebastian-nagel commented 9 years ago

Has anyone tried [http://src.chromium.org/viewvc/chrome/trunk/src/url/]? There are Python and Perl bindings.

jnioche commented 8 years ago

[c9d0f01] lowercases the protocol and hostname [3aa3391] URI normalise