Closed jetnet closed 4 years ago
Can you share your config? I cannot reproduce as I have a Connection reset
error instead.
here's the config w/o the importer stuff, but the issue is reproducible. Thanks!
I get Not Found
when trying to download your config.
it seems transfer.sh
is having some issues at the moment.
please try another one: file.io/Uz01ZR
@jetnet , Out of curiosity, how do you use these variables?
set($domain = "www.erneuerbare-energien.de")
set($domain_xn = "www.erneuerbare-energien.de")
because I can't find any other reference of these 2 variables in this configuration file.
@jetnet still a 404. Can you simply attach it to this ticket?
@essiembre, attaching is too simple :) let's try another link: 0x0.st/isqT.xml
, BTW - the first link above is working now.
@LeMoussel, I haven't included the import section into the the configuration as it's not relevant. Those variables are used for replacement during importing, as Norconex cannot handle domains with non ascii characters, e.g. www.müller.de. But it's another issue.
I was finally able to reproduce. I added a new feature on the metadata fetcher called "skipOnBadStatus". It will skip http-HEAD related activities upon failure and continue, as opposed to rejecting the documenty (issues a warning instead).
Config sample:
<metadataFetcher
class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher"
skipOnBadStatus="true" />
This new configuration option is in the latest HTTP Collector snapshot just released. Please confirm it works for you.
Thank you very much! It works as expected :)
we crawl many sites using the same configuration templates and configured the
GenericMetadataFetcher
globaly for all sites. Some sites do not allow theHEAD
request, and the crawler stops at the very beginning, e.g.:Could the crawler be configured in a way to ignore such errors and continue to crawl further? Thank you!