Handshake failure with Java8_172

essiembre commented 5 years ago

Dears, why I still get the "handshake_failure" alert with the following crawler configuration, and Java8_172 ?

<crawler id="sac_formalazio">
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>https://sac.formalazio.it/login.php</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/sac_formalazio</workDir>

    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          removeTrailingSlash, secureScheme
        </normalizations>
    </urlNormalizer>

    <httpClientFactory class="$httpClientFactory">
        <cookiesDisabled>false</cookiesDisabled>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <expectContinueEnabled>true</expectContinueEnabled>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
    </httpClientFactory>  

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>10</maxDepth>

    <!-- REQUIRED per questo canale del PRL !!! -->
    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <!-- delay default="2000" / -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
        <!-- schedule dayOfWeek="from Monday to Sunday" 
            time="from 8:00 to 20:30">86400</schedule -->
    </delay>

    <!-- keep downloaded pages/files to your filesystem './sac_formalazio/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, 
     like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json,p7m</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*sac.formalazio.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*|.*p7m.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>
            <!-- If your target repository does not support arbitrary fields,
       make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">sac_formalazio_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>

on the contrary, with a curl command (win64 on Windows) and a RESTClient for Firefox the page is immediately downloaded eg: curl -X GET -i "https://sac.formalazio.it/login.php"

Originally posted by @ciroppina in https://github.com/Norconex/collector-http/issues/446#issuecomment-462331602

essiembre commented 5 years ago

@ciroppina, Seems to be that your version of Java does not have the latest crypto ciphers required to communicate with your server via SSL.

I was able to reproduce and could not find a workaround until I installed the latest Java 8 (u202) and removed all custom settings under <httpClientFactory>. It worked just fine after that.

ciroppina commented 5 years ago

Updated Java SDK and JRE to u202, added to java 'security/cacerts' the TLS 1.2 certificate for "sac.formalazio.it:443" Removed all settings from the collector crawler config file

but I still get: https://sac.formalazio.it/login.php (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

removing "sac.formalazio.it:443" TLS certificates from Java security/cacerts, I get the same exception

essiembre commented 5 years ago

Humm... this is puzzling. I suggest you try importing the security certificate manually before running the crawler, using Java Keytool. Not always the easiest thing to do but you can find a few tutorials online. Here is one from Oracle:

https://docs.oracle.com/javase/tutorial/security/toolfilex/rstep1.html

ciroppina commented 5 years ago

Pls, deaar Pascal

let's consider closed this issue, by setting the following:

        <httpClientFactory class="$httpClientFactory">
            <cookiesDisabled>false</cookiesDisabled>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
            <expectContinueEnabled>true</expectContinueEnabled>
            <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
        </httpClientFactory>

thnks mauch

essiembre commented 5 years ago

Thanks for the update.

Norconex / crawlers

Handshake failure with Java8_172 #561