Connection Refused when trying to crawl a certain website

ciroppina commented 5 years ago

Dear Sirs,

I am successfully crawling dozen websites with the 2.8.1 Collector-Http, successfully sending/committing contents to my Solr7.5.0 schema

But a (Italian) website always returns Connection Refused at StartUrl - and the collector early terminates My config is the following:

<crawler id="lazioeuropa"> <!-- NON FUNZIONA IL CRAWLING: sempre CONNECTION REFUSED !!! -->
        <!-- UNCOMMENT START-URLS TO MAKE THIS CRAWLER WORKING -->

            <!-- Requires at least one start URL (or urlsFile). 
            Optionally limit crawling to same protocol/domain/port as 
            start URLs. 
            -->
            <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
                <url>http://www.lazioeuropa.it</url>
                <url>http://www.lazioeuropa.it/sitemap/</url>
            </startURLs>

            <userAgent>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0</userAgent>

            <!-- Specify a crawler default directory where to generate files. -->
            <workDir>./tasks-output/lazioeuropa</workDir>

            <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
            <maxDepth>10</maxDepth>

            <!-- Be as nice as you can with sites you crawl. -->
            <!-- delay default="2000" / -->
            <delay default="3000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
                <!-- schedule dayOfWeek="from Monday to Sunday" 
                    time="from 8:00 to 20:30">86400</schedule -->
            </delay>

            <!-- keep downloaded pages/files to your filesystem './rl_agricoltura/downloads/' folder -->
            <keepDownloads>false</keepDownloads>      

            <!-- Optionally filter URL BEFORE any download. Classes must implement 
             com.norconex.collector.core.filter.IReferenceFilter, 
             like the following examples.
            -->
            <referenceFilters>
                <!-- exclude extension filter -->
                <filter class="$filterExtension" onMatch="exclude" >
                    jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
                <!-- regex filters -->
                <filter class="$filterRegexRef">.*lazioeuropa.*</filter>
                <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
                <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
            </referenceFilters>

            <!-- Document importing -->
            <importer>
                <postParseHandlers>
                    <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                      <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
                    </tagger>
                    <!-- adds a constant metadata field: FromTask -->
                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                        <constant name="FromTask">lazioeuropa_task</constant>
                    </tagger>
                </postParseHandlers>
            </importer>
        </crawler>

while, the default configuration sections is:

<crawlerDefaults>
        <!-- Identify yourself to sites you crawl.  It sets the "User-Agent" HTTP 
             request header value.  This is how browsers identify themselves for
             instance.  Sometimes required to be certain values for robots.txt 
             files.
          -->
        <userAgent>progetto 'KMS NUR' (2018-2019), unified_PRL HTTP Collector</userAgent>

        <numThreads>4</numThreads>

            <!-- Stop crawling after how many successfully processed documents.  
         A successful document is one that is either new or modified, that was 
         not rejected, not deleted, or did not generate any error.  As an
         example, this is a document that will end up in your search engine. 
         Default is -1 (unlimited)
            -->
        <maxDocuments>-1</maxDocuments>

        <httpClientFactory class="$httpClientFactory">
            <connectionTimeout>300000</connectionTimeout>
            <connectionRequestTimeout>300000</connectionRequestTimeout>
            <socketTimeout>120000</socketTimeout>
            <cookiesDisabled>false</cookiesDisabled>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
        </httpClientFactory>  

        <!-- Indicates if a target URL is ready for recrawl or not.
         Default implementation is the following.
        -->
        <recrawlableResolver class="$recrawlResolver" />

        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <sitemapResolverFactory ignore="true" />

        <!-- Optionally filter URL BEFORE any download. Classes must implement 
         com.norconex.collector.core.filter.IReferenceFilter, 
         like the following examples.
        -->
        <referenceFilters>
            <!-- exclude extension filter -->
            <filter class="$filterExtension" onMatch="exclude" >
                jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
            <!-- regex filters -->
            <filter class="$filterRegexRef">.*regione.lazio.it.*</filter>
            <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
        </referenceFilters>

        <robotsMeta ignore="true" class="$robotsMeta" />

        <!-- Extract links from a document.  Classes must implement
         com.norconex.collector.http.url.ILinkExtractor. 
         Default implementation is the following.
        -->
        <linkExtractors>
            <extractor class="${linkExtractor}"  maxURLLength="2048" 
            ignoreNofollow="false" commentsEnabled="false">
                <contentTypes> <!-- all html and document content-types -->
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp, application/pdf, 
application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document, 
application/vnd.openxmlformats-officedocument.wordprocessingml.template, 
application/vnd.ms-word.document.macroEnabled.12, 
application/vnd.ms-word.template.macroEnabled.12,  
application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, 
application/vnd.openxmlformats-officedocument.spreadsheetml.template, 
application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12, 
application/vnd.ms-excel.addin.macroEnabled.12, application/vnd.ms-excel.sheet.binary.macroEnabled.12, 
application/vnd.ms-powerpoint, application/vnd.ms-powerpoint, 
application/vnd.openxmlformats-officedocument.presentationml.presentation, 
application/vnd.openxmlformats-officedocument.presentationml.template, 
application/vnd.openxmlformats-officedocument.presentationml.slideshow, 
application/vnd.ms-powerpoint.addin.macroEnabled.12, 
application/vnd.ms-powerpoint.presentation.macroEnabled.12, 
application/vnd.ms-powerpoint.template.macroEnabled.12, 
application/vnd.ms-powerpoint.slideshow.macroEnabled.12,  
application/vnd.ms-access, 
application/vnd.oasis.opendocument.text, application/vnd.oasis.opendocument.text-template, 
application/vnd.oasis.opendocument.text-web, application/vnd.oasis.opendocument.text-master, 
application/vnd.oasis.opendocument.graphics, application/vnd.oasis.opendocument.graphics-template, 
application/vnd.oasis.opendocument.presentation, application/vnd.oasis.opendocument.presentation-template, 
application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.spreadsheet-template, 
application/vnd.oasis.opendocument.chart, application/vnd.oasis.opendocument.formula, 
application/vnd.oasis.opendocument.database, application/vnd.oasis.opendocument.image, 
application/vnd.openofficeorg.extension
                </contentTypes>
                <tags>
                    <tag name="a" attribute="href" />
                    <tag name="link" attribute="href" />
                    <tag name="frame" attribute="src" />
                    <tag name="iframe" attribute="src" />
                    <tag name="img" attribute="src" />
                    <tag name="meta" attribute="http-equiv" />
                </tags>
            </extractor>
        </linkExtractors>

        <!-- Decide what to do with your files by specifying a Committer. -->
        <committer ...>

        </committer>
</crawlerDefaults>

and the log says:

unified_95_PRL_32_HTTP_32_Collector.log

abolotnov commented 5 years ago

Can you curl this site's pages from that host at all?

essiembre commented 5 years ago

I just tested and could crawl it without problems. So I think it may be a connectivity issue like suggested by @abolotnov.

ciroppina commented 5 years ago

Sorry, maybe the pasted image is not visible

immagine

ciroBorrelli

Il giorno ven 1 feb 2019 alle ore 19:23 abolotnov notifications@github.com ha scritto:

Can you curl this site's pages from that host at all?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/554#issuecomment-459818996, or mute the thread https://github.com/notifications/unsubscribe-auth/AOzk_RsAQrHhFaBpgUFoeT8fI4_-JCxlks5vJIYMgaJpZM4aeQTJ .

ciroppina commented 5 years ago

Solved adding some proxy settings

<httpClientFactory> <!-- for proxy settings -->
   <proxyHost>proxy.regione.abruzzo.it</proxyHost>
   <proxyPort>8080</proxyPort>
   <proxyScheme>http</proxyScheme>
</httpClientFactory>

Closing

Norconex / crawlers

Connection Refused when trying to crawl a certain website #554