Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

GenericMetadataFetcher & "HEAD - Method Not Allowed" #655

Closed jetnet closed 4 years ago

jetnet commented 4 years ago

we crawl many sites using the same configuration templates and configured the GenericMetadataFetcher globaly for all sites. Some sites do not allow the HEAD request, and the crawler stops at the very beginning, e.g.:

www.erneuerbare-energien.de: 2020-01-11 12:59:17 INFO -       REJECTED_BAD_STATUS: https://www.erneuerbare-energien.de/ (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=405, reasonPhrase=Method Not Allowed])

Could the crawler be configured in a way to ignore such errors and continue to crawl further? Thank you!

essiembre commented 4 years ago

Can you share your config? I cannot reproduce as I have a Connection reset error instead.

jetnet commented 4 years ago

here's the config w/o the importer stuff, but the issue is reproducible. Thanks!

essiembre commented 4 years ago

I get Not Found when trying to download your config.

jetnet commented 4 years ago

it seems transfer.sh is having some issues at the moment. please try another one: file.io/Uz01ZR

LeMoussel commented 4 years ago

@jetnet , Out of curiosity, how do you use these variables?

set($domain = "www.erneuerbare-energien.de")

set($domain_xn = "www.erneuerbare-energien.de")

because I can't find any other reference of these 2 variables in this configuration file.

essiembre commented 4 years ago

@jetnet still a 404. Can you simply attach it to this ticket?

jetnet commented 4 years ago

@essiembre, attaching is too simple :) let's try another link: 0x0.st/isqT.xml, BTW - the first link above is working now. @LeMoussel, I haven't included the import section into the the configuration as it's not relevant. Those variables are used for replacement during importing, as Norconex cannot handle domains with non ascii characters, e.g. www.müller.de. But it's another issue.

essiembre commented 4 years ago

I was finally able to reproduce. I added a new feature on the metadata fetcher called "skipOnBadStatus". It will skip http-HEAD related activities upon failure and continue, as opposed to rejecting the documenty (issues a warning instead).

Config sample:

  <metadataFetcher 
        class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" 
        skipOnBadStatus="true" />

This new configuration option is in the latest HTTP Collector snapshot just released. Please confirm it works for you.

jetnet commented 4 years ago

Thank you very much! It works as expected :)