Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

HTTP Collector v3 RC1 - REJECTED_BAD_STATUS > Connection Pool Shut Down #770

Closed adesso-thomas-lippitsch closed 2 years ago

adesso-thomas-lippitsch commented 2 years ago

I am experiencing a strange problem with the HTTP Collector v3 RC1, which could be a bug.

This is an example config based on the minimal setup included in the examples folder of Norconex v3 RC1:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<httpcollector id="Minimum Config HTTP Collector">

  <workDir>./workdir/example</workDir>
  <maxConcurrentCrawlers>1</maxConcurrentCrawlers>

  <crawlerDefaults>

    <robotsTxt ignore="true" />
    <robotsMeta ignore="true" />
    <maxDepth>1</maxDepth>
    <sitemapResolver ignore="true" />
    <delay default="5 seconds" />

    <committers>
      <committer class="XMLFileCommitter">
        <indent>4</indent>
      </committer>
    </committers>

  </crawlerDefaults>

  <crawlers>
    <crawler id="Example 1">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.example.com</url>
      </startURLs>
    </crawler>
    <crawler id="Example 2">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.example1.com</url>
      </startURLs>
    </crawler>
    <crawler id="Example 3">
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>http://quotes.toscrape.com/tag/inspirational/</url>
      </startURLs>
    </crawler>
  </crawlers>

</httpcollector>

This configuration throws the following error for all crawlers except the first one ("Example 1"):

09:30:36.845 [Example 3#1] ERROR HttpFetchClient - Fetcher GenericHttpFetcher failed to execute request.
com.norconex.collector.http.fetch.HttpFetchException: Could not fetch document: http://quotes.toscrape.com/tag/inspirational/
        at com.norconex.collector.http.fetch.impl.GenericHttpFetcher.fetch(GenericHttpFetcher.java:492) ~[norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.http.fetch.HttpFetchClient.fetch(HttpFetchClient.java:102) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.http.pipeline.importer.HttpFetchStage.executeStage(HttpFetchStage.java:50) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.http.pipeline.importer.AbstractHttpMethodStage.executeStage(AbstractHttpMethodStage.java:45) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91) [norconex-commons-lang-2.0.0-RC1.jar:2.0.0-RC1]
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:375) [norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        at com.norconex.collector.core.crawler.Crawler.processNextQueuedCrawlData(Crawler.java:605) [norconex-collector-core-2.0.0-RC1.jar:2.0.0-RC1]
        at com.norconex.collector.core.crawler.Crawler.processNextReference(Crawler.java:550) [norconex-collector-core-2.0.0-RC1.jar:2.0.0-RC1]
        at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:917) [norconex-collector-core-2.0.0-RC1.jar:2.0.0-RC1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_311]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_311]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_311]
Caused by: java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34) ~[httpcore-4.4.13.jar:4.4.13]
        at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:196) ~[httpcore-4.4.13.jar:4.4.13]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:268) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.10.jar:4.5.10]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.10.jar:4.5.10]
        at com.norconex.collector.http.fetch.impl.GenericHttpFetcher.fetch(GenericHttpFetcher.java:425) ~[norconex-collector-http-3.0.0-RC1.jar:3.0.0-RC1]
        ... 13 more
09:30:36.846 [Example 3#1] INFO  REJECTED_BAD_STATUS - http://quotes.toscrape.com/tag/inspirational/ - 0 null - GenericHttpFetcher

But as soon as I set <maxConcurrentCrawlers>-1</maxConcurrentCrawlers> it works like a charm. Also when starting a config with just one crawler.

Something seems to go wrong when there is more than one crawler in the config and not all crawlers are starting at once.

Best regards, Tom

essiembre commented 2 years ago

FYI, I was able to reproduce. Working on a fix.

essiembre commented 2 years ago

I just made a new snapshot release with a fix. Please try and confirm.

adesso-alex commented 2 years ago

Hi @essiembre,

I can confirm that this is now fixed with the newest snapshot release.

Thank you very much for the quick fix! :)