Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
180 stars 68 forks source link

ignore www prefix #614

Open jetnet opened 5 years ago

jetnet commented 5 years ago

we need to crawl many Internet sites and encountered an issue with www prefix: some sites redirect to their domains without www, some other way round. Unfortunately, such case cannot be handle by NC in general way (globally): we can normalize URLs bei removing www prefix, and, if a site would redicrect to www.some.site again, the collector would follow, as it is configured to follow sub-domains. But, there will be cases, when a site is available with www prefix only (e.g. https://www.pony.at/ does not work without www), so we will miss such sites again. So, I'm looking for a general solution for that problem. Any ideas - very welcome! Thank you!

Common requirements for a crawler:

<startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false" includeSubdomains="true">
jetnet commented 5 years ago

I'm not alone :) #596

essiembre commented 5 years ago

One solution could be to define two crawlers, one with the URL normalization to always use www and the other without the www. Then you would have to test each start URL to figure out under which one they belong.

If you do not know up front all the domains that will be crawled, it could get tricky for sure. We could make this a feature request, but I am not sure what solution could be generic enough. Especially knowing that www could technically be a subdomain that serves totally different content (even if I have never encountered such a site).

Maybe we could have a smart URL Normalizer where you can indicate your preference (www or not) and upon seeing a domain for the first time, if it does not suit your preference, it will first test if its alternate version exists before actually doing the normalization (making an extra call). I guess this could work as long as we can assume doing that test once per domain is valid for all URLs on that domain. An example:

  1. Let's say you prefer www
  2. The crawler encounters https://www.aaa.com/111.html, so it leaves it unchanged and remembers that domain to be OK.
  3. The crawler encounters https://aaa.com/222.html, it knows you prefer www and it knows it already exists, so it normalizes it to https://www.aaa.com/222.html.
  4. The crawler encounters https://bbb.com/333.html. It does not know if www exists for it so it makes an extra call to find out:
    • If it exists, it normalizes it to https://www.bbb.com/333.html and remembers it.
    • If it does not exist, it leaves it as is and remembers not to check again for that domain (never convert URLs on that domain to www).

Could an (optional) feature like that do it you think?

jetnet commented 5 years ago

maybe we could simplify the logic like following: I'd add two new options:

I assume, it would be not that easy to implement the options includeParentDomains, when allowing up to top-level domain. I'd be happy with includeParentDomains=[true|false|www], when it'd allow the single parent domain only (any or when the current one starts w/ www).

What do you think? Thanks a lot!

essiembre commented 5 years ago

Plenty of good ideas. I just marked this as a feature request.