Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Cannot fetch document: Handshake Failure #446

Closed jamieshiz closed 6 years ago

jamieshiz commented 6 years ago

I have the exact same configuration for another domain, just swapping out the domain specific stuff. I get the following error when trying to run the indexer.

ERROR [GenericDocumentFetcher] Cannot fetch document: https://www.site_name.com/ (Received fatal alert: handshake_failure)

I have verified that the cert is on the server and working properly. Any ideas on what is missing?

essiembre commented 6 years ago

By setting org.apache.* log level to DEBUG in the log4j.properties files you may have more information that can help you.

There could be a few reasons, but two common issues with https sites is Java not accepting the site certificate or the SSL protocol used by default on the site is not supported by your Java version.

The following settings can help (goes under your crawler config):

<httpClientFactory>
   <trustAllSSLCertificates>true</trustAllSSLCertificates>
   <sslProtocols>(coma-separated list)</sslProtocols>
</httpClientFactory>

For the sslProtocols, here is the relevant excerpt from the GenericHttpClientFactory:

Sets the supported SSL/TLS protocols, such as SSLv3, TLSv1, TLSv1.1, and TLSv1.2. Note that specifying a protocol not supported by your underlying Java platform will not work.

jamieshiz commented 6 years ago

@essiembre - thanks for your input. I have tried adding the httpClientFactory settings in my config settings but am still getting the error. I have also added 'log4j.logger.org.apache.*log=DEBUG' to the log4j.properties file and am not seeing any additional errors:

REJECTED_ERROR: https://www.sitename.com (com.norconex.collector.core.CollectorException: javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure)

Below is the siteconfig file I am using and I have confirmed that the SSL cert is valid and using TLS 1.2.

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2015 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and basic recommendations
     to run a crawler.  
     -->
<httpcollector id="www-sitename-com">

  <!-- Decide where to store generated files. -->
  <progressDir>/siteconfigs/workdir/www-sitename-com/progress</progressDir>
  <logsDir>/siteconfigs/workdir/www-sitename-com/logs</logsDir>

  <crawlers>
    <crawler id="CloudSearch">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.sitename.com/</url>
      </startURLs>
      <!-- Generic implementation of IURLNormalizer that should satisfy most URL normalization needs
           The following adds a normalization to add "www." to URL domains when missing, removes
           trailing slash from url, ensures all URLS are secure -->
      <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          addWWW, removeTrailingSlash, secureScheme
        </normalizations>
      </urlNormalizer>
      <httpClientFactory>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <sslProtocols>TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
      </httpClientFactory>
      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>/siteconfigs/workdir/www-sitename-com</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>3</maxDepth>
      <maxDocuments>-1</maxDocuments>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <!-- Since 2.3.0: -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->o
      <delay default="5000" />

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" 
                onMatch="exclude"
                caseSensitive="false">png,gif,jpg,jpeg,js,css</filter>
        <filter 
          class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" 
          onMatch="exclude" 
          caseSensitive="false">RequestADemo</filter>
      </referenceFilters>

      <!-- Document importing -->
      <importer>
        <preParseHandlers>
          <tagger class="biz.reddoor.norconex.importer.handler.tagger.impl.DocumentReferenceIDTagger" field="referenceid" />
        </preParseHandlers>
        <postParseHandlers>
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,description,document.reference,referenceid,content,keywords</fields>
          </tagger>
        </postParseHandlers>
      </importer>

     <!-- Decide what to do with your files by specifying a Committer. 
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>/siteconfigs/workdir/www-sitename-com/crawled-files</directory>
      </committer> -->

      <committer class="com.norconex.committer.cloudsearch.CloudSearchCommitter">

        <!-- Mandatory: -->
        <documentEndpoint>aws-cloudsearch-url</documentEndpoint>

        <!-- Mandatory if not configured elsewhere: -->
        <accessKey>
          access-key-here
        </accessKey>
        <secretKey>
          secret-key-here
        </secretKey>

        <sourceReferenceField keep="true">referenceid</sourceReferenceField>

        <!-- Optional settings: 
        <sourceReferenceField keep="[false|true]">
          (Optional name of field that contains the document reference, when 
          the default document reference is not used.  The reference value
          will be mapped to CloudSearch "id" field, which is mandatory.
          Once re-mapped, this metadata source field is 
          deleted, unless "keep" is set to true.)
        </sourceReferenceField>
        <sourceContentField keep="[false|true]">
          (If you wish to use a metadata field to act as the document 
          "content", you can specify that field here.  Default 
          does not take a metadata field but rather the document content.
          Once re-mapped, the metadata source field is deleted,
          unless "keep" is set to true.)
        </sourceContentField>
        <targetContentField>
          (CloudSearch target field name for a document content/body.
            Default is: content)
        </targetContentField>
        <commitBatchSize>
            (Max number of docs to send CloudSearch at once. If you experience
            memory problems, lower this number.  Default is 100.)
        </commitBatchSize>
        <queueDir>(Optional path where to queue files)</queueDir>
        <queueSize>
            (Max queue size before committing. Default is 1000.)
        </queueSize>
        <maxRetries>
            (Max retries upon commit failures. Default is 0.)
        </maxRetries>
        <maxRetryWait>
            (Max delay between retries. Default is 0.)
        </maxRetryWait>
        -->
      </committer>

    </crawler>
  </crawlers> 
</httpcollector>
essiembre commented 6 years ago

I am not able to advise what the issue is without being able to reproduce. Feel free to share credentials privately if you want.

About the log4j line, it should rather be like this:

log4j.logger.org.apache.http=DEBUG

Maybe with more logging, we'll get a better idea.

jamieshiz commented 6 years ago

sent you an email with additional details

jamieshiz commented 6 years ago

I was able to resolve the issue by upgrading to Java 8 on the docker image I was using. Thanks for your help @essiembre

ciroppina commented 5 years ago

Dears, why I still get the "handshake_failure" alert with the following crawler configuration, and Java8_172 ?

<crawler id="sac_formalazio">
    <startURLs stayOnDomain="true" stayOnPort="false" stayOnProtocol="false">
        <url>https://sac.formalazio.it/login.php</url>
    </startURLs>

    <userAgent>Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0</userAgent>

    <!-- Specify a crawler default directory where to generate files. -->
    <workDir>./tasks-output/sac_formalazio</workDir>

    <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
        <normalizations>
          removeTrailingSlash, secureScheme
        </normalizations>
    </urlNormalizer>

    <httpClientFactory class="$httpClientFactory">
        <cookiesDisabled>false</cookiesDisabled>
        <trustAllSSLCertificates>true</trustAllSSLCertificates>
        <expectContinueEnabled>true</expectContinueEnabled>
        <sslProtocols>SSLv3, TLSv1, TLSv1.1, TLSv1.2</sslProtocols>
    </httpClientFactory>  

    <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
    <maxDepth>10</maxDepth>

    <!-- REQUIRED per questo canale del PRL !!! -->
    <robotsTxt ignore="true"/>
    <robotsMeta ignore="true"/>

    <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
    <sitemapResolverFactory ignore="true" />

    <!-- Be as nice as you can with sites you crawl. -->
    <!-- delay default="2000" / -->
    <delay default="2000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
        <!-- schedule dayOfWeek="from Monday to Sunday" 
            time="from 8:00 to 20:30">86400</schedule -->
    </delay>

    <!-- keep downloaded pages/files to your filesystem './sac_formalazio/downloads/' folder -->
    <keepDownloads>false</keepDownloads>

    <!-- Optionally filter URL BEFORE any download. Classes must implement 
     com.norconex.collector.core.filter.IReferenceFilter, 
     like the following examples.
    -->
    <referenceFilters>
        <!-- exclude extension filter -->
        <filter class="$filterExtension" onMatch="exclude" >
            jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json,p7m</filter>
        <!-- regex filters -->
        <filter class="$filterRegexRef">.*sac.formalazio.*</filter>
        <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
        <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*|.*p7m.*</filter>
    </referenceFilters>

    <!-- Document importing -->
    <importer>
        <postParseHandlers>
            <!-- If your target repository does not support arbitrary fields,
       make sure you only keep the fields you need. -->
            <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
              <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
            </tagger>
            <!-- adds a constant metadata field: FromTask -->
            <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                <constant name="FromTask">sac_formalazio_task</constant>
            </tagger>
        </postParseHandlers>
    </importer>
</crawler>

on the contrary, with a curl command (win64 on Windows) and a RESTClient for Firefox the page is immediately downloaded eg: curl -X GET -i "https://sac.formalazio.it/login.php"

essiembre commented 5 years ago

This ticket is closed. Please see #561.