Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

GenericMetadataFetcher: Bad Request with Redirect #695

Closed jetnet closed 2 years ago

jetnet commented 4 years ago

hello Pascal,

we encountered another issue with the metadata fetcher: some websites, which do not support HEAD request, not only reply Bad Request, but add a Location HTTP response header to an error page. It would be great, if the new parameter skipOnBadStatus="true" would skip the redirect instruction as well. Example: curl -IsL http://www.hamburgwasser.de Thank you!

essiembre commented 4 years ago

Please try with the newly released snapshot and confirm.

jetnet commented 4 years ago

Thanks a lot! It is working now. One thought regarding WARN [MetadataFetcherStage] Bad HTTP response when fetching metadata. The message can be disabled via Log4j (log4j.logger.com.norconex.collector.http.pipeline.importer.MetadataFetcherStage=ERROR), but unsupported requests are still being sent. Do you mean it'd make sense to disable MetadataFetcher automatically, when HEAD is not supported? E.g. send a HEAD request, it fails, when send a GET, and if the second one returns a good status, then HEAD is not supported and the metadata fetcher can be disabled.

essiembre commented 4 years ago

We can't really do that because a failed HEAD request is not an absolute indicator that all HEAD requests will be bad for a site, Plus, large sites could have different sections rendered in different ways, some pages supporting HEAD, some not. Also, crawlers can crawl multiple sites at once.

Ideally, we would have the option to disable metadata fetching only for pages matching a given pattern. I'll make this a feature request.

essiembre commented 3 years ago

The latest 3.0.0 snapshot version now has this feature. You can now have HEAD performed for some URL patterns only. You can also make HEAD optional, meaning if it has a bad status, it will continue with a GET instead of rejecting that URL.

Please refer to #654 for a usage example.