Closed jetnet closed 2 years ago
Please try with the newly released snapshot and confirm.
Thanks a lot! It is working now.
One thought regarding WARN [MetadataFetcherStage] Bad HTTP response when fetching metadata
. The message can be disabled via Log4j (log4j.logger.com.norconex.collector.http.pipeline.importer.MetadataFetcherStage=ERROR
), but unsupported requests are still being sent. Do you mean it'd make sense to disable MetadataFetcher automatically, when HEAD is not supported? E.g. send a HEAD request, it fails, when send a GET, and if the second one returns a good status, then HEAD is not supported and the metadata fetcher can be disabled.
We can't really do that because a failed HEAD request is not an absolute indicator that all HEAD requests will be bad for a site, Plus, large sites could have different sections rendered in different ways, some pages supporting HEAD, some not. Also, crawlers can crawl multiple sites at once.
Ideally, we would have the option to disable metadata fetching only for pages matching a given pattern. I'll make this a feature request.
The latest 3.0.0 snapshot version now has this feature. You can now have HEAD
performed for some URL patterns only. You can also make HEAD
optional, meaning if it has a bad status, it will continue with a GET
instead of rejecting that URL.
Please refer to #654 for a usage example.
hello Pascal,
we encountered another issue with the metadata fetcher: some websites, which do not support
HEAD
request, not only replyBad Request
, but add aLocation
HTTP response header to an error page. It would be great, if the new parameterskipOnBadStatus="true"
would skip the redirect instruction as well. Example:curl -IsL http://www.hamburgwasser.de
Thank you!