Closed AntonioAmore closed 8 years ago
Yes, but what tells you the redirect target is not processed? You have nothing crawled after that message? REJECTED_REDIRECTED means the URL http://www.site.com
is being rejected because it redirects somewhere, but the redirect is followed and the target URL should be processed normally. In this case, the target is https://www.site.com
(https). Does the log mention anything about https://www.site.com
?
It also writes:
Cannot fetch document: .... url (handshake alert: unrecognized_name )
REJECTED_ERROR: ... url
Handshake-related errors usually have to do with failing to resolve the SSL certificate. Either it expired, you need to tell java you trust it, or whatever other possible SSL error. One easy way to get around this usually is to accept all certificates (usually a bad idea but for crawling you often can live with that). Try:
<httpClientFactory>
...
<trustAllSSLCertificates>true</trustAllSSLCertificates>
</httpClientFactory>
I added following lines to the crawlerDefaults, but it continue delivering unrecognized name
.
I'm sure there is no override in crawler configuration.
Do you need stack trace from the log? I may post it here.
Your full config would help the most.
May I send it to your email?
Sure thing. You can find it in my profile.
I was able to replicate. It turns this problem started appearing only after a certain version of Java.
Java introduced Server Name Indication (SNI) extension on Java and it is enabled by default. There are servers not configured properly that will send a "Unrecognized Name" warning in the SSL handshake. Most clients ignore this message (as they should), but Java fails on it. You can disable SNI by adding the following JVM system property when launching the collector:
java -Djsse.enableSNIExtension=false
This won't resolve everything though. Because the site you crawl also seems to be using an older, insecure encryption protocol (MD2). Java no longer supports that protocol when it ships. You can force it to support it by editing this file:
JDK_HOME/jre/lib/security/java.security
In it, change this line:
jdk.certpath.disabledAlgorithms=MD2
to this one:
jdk.certpath.disabledAlgorithms=
I will try to find a permanent fix in HTTP Collector itself, but in the meantime, the above instructions should allow you to keep crawling.
I released a new snapshot release with a fix. Setting trustAllSSLCertificates
to true
will now also disable SNI Extension and enable unsafe algorithms. You no longer have to change your java.security
file. Please confirm.
Thanks a lot, I'll provide feedback nearest time.
ERROR - www.site.com: Could not process document: https://www.site.com (javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints)
com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: Certificates does not conform to algorithm constraints
at com.norconex.collector.http.fetch.impl.GenericDocumentFetcher.fetchDocument(GenericDocumentFetcher.java:173)
at com.norconex.collector.http.pipeline.importer.DocumentFetcherStage.executeStage(DocumentFetcherStage.java:48)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
...
Used 2.4.0 snapshot. Any additional info I may provide?
Not sure why it is not working for you. Which JRE version are you using? Can you attach your config?
java version "1.8.0_05" I've sent the config once, by email.
Luckily I still had it. I just tested with 1.8 and I was able to reproduce the exception. It appears the solution put in place works fine under 1.7.x. I will investigate when I have a chance, but in the meantime, you may want to use the JRE-tweaking measures suggested before to enable that site with an insecure protocol (or less fun: downgrade your JRE).
Thank you for the detailed response. I'll try to downgrade JRE and provide feedback.
I finally managed to have it working on both Java 7 and Java 8 without having to change the JRE java.security
file or adding a JVM argument. I made a new snapshot release with this fix. Please test it to confirm when you have a chance.
Thanks a lot! I confirm it delivers no more exceptions with SSL.
Awesome! That was a nasty one.
During launch of the collector I got following messages in log:
startUrls comes to config without any attribute, which means, according the documentation, stayOnDomain="false", stayOnPort="false", stayOnProtocol="false". So it should serve such redirection correctly, I guess.
Using 2.3.0 release. Reference filter isn't set to pass http://:`^.*www\.site\.com/somefilterlogic$` Start url is http:// but it shouldn't cause problems.