Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Unsupported HTTP response - 301 #181

Closed AntonioAmore closed 8 years ago

AntonioAmore commented 8 years ago

During launch of the collector I got following messages in log:

Unsupported HTTP response HTTP/1.1 301 Moved Permanently
REJECTED_REDIRECTED: http://www.site.com (Subject: HttpFetchResponse [crawlState=REDIRECT, statusCode=301, reasonPhrase=Moved Permanently (https://www.site.com)])

startUrls comes to config without any attribute, which means, according the documentation, stayOnDomain="false", stayOnPort="false", stayOnProtocol="false". So it should serve such redirection correctly, I guess.

Using 2.3.0 release. Reference filter isn't set to pass http://:`^.*www\.site\.com/somefilterlogic$` Start url is http:// but it shouldn't cause problems.

essiembre commented 8 years ago

Yes, but what tells you the redirect target is not processed? You have nothing crawled after that message? REJECTED_REDIRECTED means the URL http://www.site.com is being rejected because it redirects somewhere, but the redirect is followed and the target URL should be processed normally. In this case, the target is https://www.site.com (https). Does the log mention anything about https://www.site.com?

AntonioAmore commented 8 years ago

It also writes:

Cannot fetch document: .... url (handshake alert: unrecognized_name )
   REJECTED_ERROR: ... url
essiembre commented 8 years ago

Handshake-related errors usually have to do with failing to resolve the SSL certificate. Either it expired, you need to tell java you trust it, or whatever other possible SSL error. One easy way to get around this usually is to accept all certificates (usually a bad idea but for crawling you often can live with that). Try:

<httpClientFactory>
    ...
    <trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
AntonioAmore commented 8 years ago

I added following lines to the crawlerDefaults, but it continue delivering unrecognized name. I'm sure there is no override in crawler configuration.

AntonioAmore commented 8 years ago

Do you need stack trace from the log? I may post it here.

essiembre commented 8 years ago

Your full config would help the most.

AntonioAmore commented 8 years ago

May I send it to your email?

essiembre commented 8 years ago

Sure thing. You can find it in my profile.

essiembre commented 8 years ago

I was able to replicate. It turns this problem started appearing only after a certain version of Java.

Java introduced Server Name Indication (SNI) extension on Java and it is enabled by default. There are servers not configured properly that will send a "Unrecognized Name" warning in the SSL handshake. Most clients ignore this message (as they should), but Java fails on it. You can disable SNI by adding the following JVM system property when launching the collector:

java -Djsse.enableSNIExtension=false 

This won't resolve everything though. Because the site you crawl also seems to be using an older, insecure encryption protocol (MD2). Java no longer supports that protocol when it ships. You can force it to support it by editing this file:

JDK_HOME/jre/lib/security/java.security

In it, change this line:

jdk.certpath.disabledAlgorithms=MD2

to this one:

jdk.certpath.disabledAlgorithms=

I will try to find a permanent fix in HTTP Collector itself, but in the meantime, the above instructions should allow you to keep crawling.

essiembre commented 8 years ago

I released a new snapshot release with a fix. Setting trustAllSSLCertificates to true will now also disable SNI Extension and enable unsafe algorithms. You no longer have to change your java.security file. Please confirm.

AntonioAmore commented 8 years ago

Thanks a lot, I'll provide feedback nearest time.

AntonioAmore commented 8 years ago
ERROR - www.site.com: Could not process document: https://www.site.com (javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints)
com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints
    at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:173)
    at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:48)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
...

Used 2.4.0 snapshot. Any additional info I may provide?

essiembre commented 8 years ago

Not sure why it is not working for you. Which JRE version are you using? Can you attach your config?

AntonioAmore commented 8 years ago

java version "1.8.0_05" I've sent the config once, by email.

essiembre commented 8 years ago

Luckily I still had it. I just tested with 1.8 and I was able to reproduce the exception. It appears the solution put in place works fine under 1.7.x. I will investigate when I have a chance, but in the meantime, you may want to use the JRE-tweaking measures suggested before to enable that site with an insecure protocol (or less fun: downgrade your JRE).

AntonioAmore commented 8 years ago

Thank you for the detailed response. I'll try to downgrade JRE and provide feedback.

essiembre commented 8 years ago

I finally managed to have it working on both Java 7 and Java 8 without having to change the JRE java.security file or adding a JVM argument. I made a new snapshot release with this fix. Please test it to confirm when you have a chance.

AntonioAmore commented 8 years ago

Thanks a lot! I confirm it delivers no more exceptions with SSL.

essiembre commented 8 years ago

Awesome! That was a nasty one.