Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Issue with Redirect #273

Closed shubhamsamy closed 8 years ago

shubhamsamy commented 8 years ago

Hi, I am getting issue while crawler trying to crawl redirect URLs. I am trying to crawl a url from non standard site map page. The link from site map page leads to a secure page.While trying to crawl the secure page, crawler is throwing error: Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@6acc88b5

Please find attached config and log. Please have a look and let me know if I there is any thing wrong in the configuration.. Regards, Subba

config_other_www.txt WWW.txt

essiembre commented 8 years ago

I could not find the error you are refering to in the log you attached. Did you attach the right log?

Seeing the full error in context I may be able to answer for sure, but likely that message is harmless. If you are truly talking about a redirect... know that the original URL in a redirect gets "rejected" while the target URL will get queued for processing. If the target URL does not get rejected becasue of whatever rules in your config, it will be processed just fine. Do you have cases where it shows otherwise? Please give a specific URL as an example.

shubhamsamy commented 8 years ago

Hi Pascal, Thanks for your response. Please find attached my complete configuration and logs). I am sending you in mail due to security reason. Site map page is https where as all the urls in the site map is http.The urls inside the sitemap page is being redirected to https. Nothing happens after the crawler queued the URL for processing. The error shows: Pipeline execution stopped at stage: com.norconex.collector.http.pipeline.importer.DocumentFetcherStage@5975307d Not sure as some thing wrong with import configuration. Please have a look and let me know as what could be the reason. Regards, Subba

On Tue, Jul 12, 2016 at 2:13 AM, Pascal Essiembre notifications@github.com wrote:

I could not find the error you are refering to in the log you attached. Did you attach the right log?

Seeing the full error in context I may be able to answer for sure, but likely that message is harmless. If you are truly talking about a redirect... know that the original URL in a redirect gets "rejected" while the target URL will get queued for processing. If the target URL does not get rejected becasue of whatever rules in your config, it will be processed just fine. Do you have cases where it shows otherwise? Please give a specific URL as an example.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/273#issuecomment-231858671, or mute the thread https://github.com/notifications/unsubscribe/AL3cBjGaSqH_rGVWNfAnoQ5p_btrYnUmks5qUqrigaJpZM4JEze- .

essiembre commented 8 years ago

@shubhamsamy, it looks like you replied to the github email. So it was added to github but your attachments did not make it. If you want to send me something personally, check my github profile for my email.

shubhamsamy commented 8 years ago

Hi Pascal, Issue is resolved. We added the following configuration:

 <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer"     disabled="false">
    <normalizations>
      secureScheme 
    </normalizations>
  </urlNormalizer>

and now all the URLs in site map page which had http is converted to https and then connector is able to crawl the URLs.. Regards, Subba

essiembre commented 8 years ago

Glad you found a solution. Just know that by explicitely adding "just" secureScheme, you are removing other defaults (see GenericURLNormalizer for list of defaults). If you want to add your entry to the list of defaults, here is the list you should have instead:

removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence, 
decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters, 
secureScheme

There is also the option to allow the crawler to accept both http and https like this:

<startURLs ... stayOnProtocol="false">
...
</startURLs>

But if you want to prevent ending up with a mix of both, the normalization approach is better.