Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Handshake failure with Java8_172 #561

Closed essiembre closed 5 years ago

essiembre commented 5 years ago

Dears, why I still get the "handshake_failure" alert with the following crawler configuration, and Java8_172 ?

<crawler id="sac_formalazio">
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>https://sac.formalazio.it/login.php</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/sac_formalazio</workDir>

    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          removeTrailingSlash, secureScheme
        </normalizations>
    </urlNormalizer>

    <httpClientFactory class="$httpClientFactory">
        <cookiesDisabled>false</cookiesDisabled>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <expectContinueEnabled>true</expectContinueEnabled>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
    </httpClientFactory>  

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>10</maxDepth>

    <!-- REQUIRED per questo canale del PRL !!! -->
    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <!-- delay default="2000" / -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
        <!-- schedule dayOfWeek="from Monday to Sunday" 
            time="from 8:00 to 20:30">86400</schedule -->
    </delay>

    <!-- keep downloaded pages/files to your filesystem './sac_formalazio/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, 
     like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json,p7m</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*sac.formalazio.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*|.*p7m.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>
            <!-- If your target repository does not support arbitrary fields,
       make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">sac_formalazio_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>

on the contrary, with a curl command (win64 on Windows) and a RESTClient for Firefox the page is immediately downloaded eg: curl -X GET -i "https://sac.formalazio.it/login.php"

Originally posted by @ciroppina in https://github.com/Norconex/collector-http/issues/446#issuecomment-462331602

essiembre commented 5 years ago

@ciroppina, Seems to be that your version of Java does not have the latest crypto ciphers required to communicate with your server via SSL.

I was able to reproduce and could not find a workaround until I installed the latest Java 8 (u202) and removed all custom settings under <httpClientFactory>. It worked just fine after that.

ciroppina commented 5 years ago

Updated Java SDK and JRE to u202, added to java 'security/cacerts' the TLS 1.2 certificate for "sac.formalazio.it:443" Removed all settings from the collector crawler config file

but I still get: https://sac.formalazio.it/login.php (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

removing "sac.formalazio.it:443" TLS certificates from Java security/cacerts, I get the same exception

essiembre commented 5 years ago

Humm... this is puzzling. I suggest you try importing the security certificate manually before running the crawler, using Java Keytool. Not always the easiest thing to do but you can find a few tutorials online. Here is one from Oracle:

https://docs.oracle.com/javase/tutorial/security/toolfilex/rstep1.html

ciroppina commented 5 years ago

Pls, deaar Pascal

let's consider closed this issue, by setting the following:

        <httpClientFactory class="$httpClientFactory">
            <cookiesDisabled>false</cookiesDisabled>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
            <expectContinueEnabled>true</expectContinueEnabled>
            <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
        </httpClientFactory>  

thnks mauch

essiembre commented 5 years ago

Thanks for the update.