Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Connection Refused when trying to crawl a certain website #554

Closed ciroppina closed 5 years ago

ciroppina commented 5 years ago

Dear Sirs,

I am successfully crawling dozen websites with the 2.8.1 Collector-Http, successfully sending/committing contents to my Solr7.5.0 schema

But a (Italian) website always returns Connection Refused at StartUrl - and the collector early terminates My config is the following:

<crawler id="lazioeuropa"> <!-- NON FUNZIONA IL CRAWLING: sempre CONNECTION REFUSED !!! -->
        <!-- UNCOMMENT START-URLS TO MAKE THIS CRAWLER WORKING -->

            <!-- Requires at least one start URL (or urlsFile). 
            Optionally limit crawling to same protocol/domain/port as 
            start URLs. 
            -->
            <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="false">
                <url>http://www.lazioeuropa.it</url>
                <url>http://www.lazioeuropa.it/sitemap/</url>
            </startURLs>

            <userAgent>Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0</userAgent>

            <!-- Specify a crawler default directory where to generate files. -->
            <workDir>./tasks-output/lazioeuropa</workDir>

            <!-- Put a maximum depth to avoid infinite crawling (default: -1). -->
            <maxDepth>10</maxDepth>

            <!-- Be as nice as you can with sites you crawl. -->
            <!-- delay default="2000" / -->
            <delay default="3000" ignoreRobotsCrawlDelay="true" class="$delayResolver">
                <!-- schedule dayOfWeek="from Monday to Sunday" 
                    time="from 8:00 to 20:30">86400</schedule -->
            </delay>

            <!-- keep downloaded pages/files to your filesystem './rl_agricoltura/downloads/' folder -->
            <keepDownloads>false</keepDownloads>      

            <!-- Optionally filter URL BEFORE any download. Classes must implement 
             com.norconex.collector.core.filter.IReferenceFilter, 
             like the following examples.
            -->
            <referenceFilters>
                <!-- exclude extension filter -->
                <filter class="$filterExtension" onMatch="exclude" >
                    jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
                <!-- regex filters -->
                <filter class="$filterRegexRef">.*lazioeuropa.*</filter>
                <filter class="$filterRegexRef">.*regione.lazio.it/binary/.*</filter>
                <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
            </referenceFilters>

            <!-- Document importing -->
            <importer>
                <postParseHandlers>
                    <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                      <fields>title,keywords,description,document.reference,document.contentType,collector.referenced-urls</fields>
                    </tagger>
                    <!-- adds a constant metadata field: FromTask -->
                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                        <constant name="FromTask">lazioeuropa_task</constant>
                    </tagger>
                </postParseHandlers>
            </importer>
        </crawler>

while, the default configuration sections is:

<crawlerDefaults>
        <!-- Identify yourself to sites you crawl.  It sets the "User-Agent" HTTP 
             request header value.  This is how browsers identify themselves for
             instance.  Sometimes required to be certain values for robots.txt 
             files.
          -->
        <userAgent>progetto 'KMS NUR' (2018-2019), unified_PRL HTTP Collector</userAgent>

        <numThreads>4</numThreads>

            <!-- Stop crawling after how many successfully processed documents.  
         A successful document is one that is either new or modified, that was 
         not rejected, not deleted, or did not generate any error.  As an
         example, this is a document that will end up in your search engine. 
         Default is -1 (unlimited)
            -->
        <maxDocuments>-1</maxDocuments>

        <httpClientFactory class="$httpClientFactory">
            <connectionTimeout>300000</connectionTimeout>
            <connectionRequestTimeout>300000</connectionRequestTimeout>
            <socketTimeout>120000</socketTimeout>
            <cookiesDisabled>false</cookiesDisabled>
            <trustAllSSLCertificates>true</trustAllSSLCertificates>
        </httpClientFactory>  

        <!-- Indicates if a target URL is ready for recrawl or not.
         Default implementation is the following.
        -->
        <recrawlableResolver class="$recrawlResolver" />

        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <sitemapResolverFactory ignore="true" />

        <!-- Optionally filter URL BEFORE any download. Classes must implement 
         com.norconex.collector.core.filter.IReferenceFilter, 
         like the following examples.
        -->
        <referenceFilters>
            <!-- exclude extension filter -->
            <filter class="$filterExtension" onMatch="exclude" >
                jpg,gif,png,ico,bmp,tiff,svg,jpeg,css,js,less,json</filter>
            <!-- regex filters -->
            <filter class="$filterRegexRef">.*regione.lazio.it.*</filter>
            <filter class="$filterRegexRef" onMatch="exclude">.*image.*|.*gallery.*|.*json.*|.*ical=.*|.*/css/.*</filter>
        </referenceFilters>

        <robotsMeta ignore="true" class="$robotsMeta" />

        <!-- Extract links from a document.  Classes must implement
         com.norconex.collector.http.url.ILinkExtractor. 
         Default implementation is the following.
        -->
        <linkExtractors>
            <extractor class="${linkExtractor}"  maxURLLength="2048" 
            ignoreNofollow="false" commentsEnabled="false">
                <contentTypes> <!-- all html and document content-types -->
text/html, application/xhtml+xml, vnd.wap.xhtml+xml, x-asp, application/pdf, 
application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document, 
application/vnd.openxmlformats-officedocument.wordprocessingml.template, 
application/vnd.ms-word.document.macroEnabled.12, 
application/vnd.ms-word.template.macroEnabled.12,  
application/vnd.ms-excel, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, 
application/vnd.openxmlformats-officedocument.spreadsheetml.template, 
application/vnd.ms-excel.sheet.macroEnabled.12, application/vnd.ms-excel.template.macroEnabled.12, 
application/vnd.ms-excel.addin.macroEnabled.12, application/vnd.ms-excel.sheet.binary.macroEnabled.12, 
application/vnd.ms-powerpoint, application/vnd.ms-powerpoint, 
application/vnd.openxmlformats-officedocument.presentationml.presentation, 
application/vnd.openxmlformats-officedocument.presentationml.template, 
application/vnd.openxmlformats-officedocument.presentationml.slideshow, 
application/vnd.ms-powerpoint.addin.macroEnabled.12, 
application/vnd.ms-powerpoint.presentation.macroEnabled.12, 
application/vnd.ms-powerpoint.template.macroEnabled.12, 
application/vnd.ms-powerpoint.slideshow.macroEnabled.12,  
application/vnd.ms-access, 
application/vnd.oasis.opendocument.text, application/vnd.oasis.opendocument.text-template, 
application/vnd.oasis.opendocument.text-web, application/vnd.oasis.opendocument.text-master, 
application/vnd.oasis.opendocument.graphics, application/vnd.oasis.opendocument.graphics-template, 
application/vnd.oasis.opendocument.presentation, application/vnd.oasis.opendocument.presentation-template, 
application/vnd.oasis.opendocument.spreadsheet, application/vnd.oasis.opendocument.spreadsheet-template, 
application/vnd.oasis.opendocument.chart, application/vnd.oasis.opendocument.formula, 
application/vnd.oasis.opendocument.database, application/vnd.oasis.opendocument.image, 
application/vnd.openofficeorg.extension
                </contentTypes>
                <tags>
                    <tag name="a" attribute="href" />
                    <tag name="link" attribute="href" />
                    <tag name="frame" attribute="src" />
                    <tag name="iframe" attribute="src" />
                    <tag name="img" attribute="src" />
                    <tag name="meta" attribute="http-equiv" />
                </tags>
            </extractor>
        </linkExtractors>

        <!-- Decide what to do with your files by specifying a Committer. -->
        <committer ...>

        </committer>
</crawlerDefaults>

and the log says:

unified_95_PRL_32_HTTP_32_Collector.log

abolotnov commented 5 years ago

Can you curl this site's pages from that host at all?

essiembre commented 5 years ago

I just tested and could crawl it without problems. So I think it may be a connectivity issue like suggested by @abolotnov.

ciroppina commented 5 years ago

Sorry, maybe the pasted image is not visible

immagine

ciroBorrelli


Il giorno ven 1 feb 2019 alle ore 19:23 abolotnov notifications@github.com ha scritto:

Can you curl this site's pages from that host at all?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Norconex/collector-http/issues/554#issuecomment-459818996, or mute the thread https://github.com/notifications/unsubscribe-auth/AOzk_RsAQrHhFaBpgUFoeT8fI4_-JCxlks5vJIYMgaJpZM4aeQTJ .

ciroppina commented 5 years ago

Solved adding some proxy settings

<httpClientFactory> <!-- for proxy settings -->
   <proxyHost>proxy.regione.abruzzo.it</proxyHost>
   <proxyPort>8080</proxyPort>
   <proxyScheme>http</proxyScheme>
</httpClientFactory>

Closing