Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

"handshake_failure" alert Exception (raised again) #560

Closed ciroppina closed 5 years ago

ciroppina commented 5 years ago

Why this crawler configuration always return a "handshake_failure" alert and a java.net SSLHandshakeException ?

<crawler id="sapp2_formalazio">
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>https://sapp2.formalazio.it/sapp/login</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/sapp2_formalazio</workDir>

    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          removeTrailingSlash, secureScheme
        </normalizations>
    </urlNormalizer>

    <httpClientFactory class="$httpClientFactory">
        <cookiesDisabled>false</cookiesDisabled>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <expectContinueEnabled>true</expectContinueEnabled>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
    </httpClientFactory>  

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>10</maxDepth>

    <!-- REQUIRED per questo canale del PRL !!! -->
    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <!-- delay default="2000" / -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
        <!-- schedule dayOfWeek="from Monday to Sunday" 
            time="from 8:00 to 20:30">86400</schedule -->
    </delay>

    <!-- keep downloaded pages/files to your filesystem './sapp2_formalazio/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, 
     like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json,p7m</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*sapp2.formalazio.*</filter>
        <filter class="$filterRegexRef">.*/sapp/.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*|.*p7m.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>
            <!-- If your target repository does not support arbitrary fields,
       make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">sapp2_formalazio_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>

on the contrary, by issuing that URL with a curl command (win_64 on Windows) and/or by using a RESTClient for Firefox, the page gets immediately and correctly downloaded eg, curl -X GET -i "https://sapp2.formalazio.it/sapp/login"

essiembre commented 5 years ago

This looks like the same issue as #561. Please confirm.

ciroppina commented 5 years ago

Yes, I confirm - then I close this