Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

HTTP collector never exits when committing to elasticsearch #3

Closed niels closed 7 years ago

niels commented 8 years ago

When committing to elasticsearch (see the below config), the collector-http.sh script never terminates even though the crawler run has already ended. I have to manually kill the process using CTRL+C or kill.

This is using the norconex-collector-http-2.4.0-20151209.033143-7 snapshot and norconex-committer-elasticsearch-2.0.1 against elasticsearch 1.7.3. I understand that there might be an incompatibility between the committer and ES 1.7 but the documents are committed to ES just fine. Please close this issue if relates to #2 after all.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <maxDocuments>1</maxDocuments>

      <committer
        class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"
      >
        <indexName>crawler_test</indexName>
        <typeName>test_doc</typeName>
        <queueSize>1</queueSize>
        <commitBatchSize>1</commitBatchSize>
      </committer>

      <startURLs stayOnDomain="true">
        <url>https://biertastisch.de</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Normal output:

$ ./collector-http.sh -a start -c test.xml
INFO  [AbstractCollectorConfig] Configuration loaded: id=test-collector; logsDir=./logs; progressDir=./progress
INFO  [JobSuite] JEF work directory is: ./progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] No previous execution detected.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.4.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.3.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.4.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.0.7 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.0.3 (Norconex Inc.)
INFO  [AbstractCollector] Version: "ElasticsearchCommitter" version is undefined.
INFO  [JobSuite] Running test-crawler: BEGIN (Wed Dec 09 18:11:13 CET 2015)
INFO  [MapDBCrawlDataStore] Initializing reference store ./work/crawlstore/mapdb/test-crawler/
INFO  [MapDBCrawlDataStore] ./work/crawlstore/mapdb/test-crawler/: Done initializing databases.
INFO  [HttpCrawler] test-crawler: RobotsTxt support: true
INFO  [HttpCrawler] test-crawler: RobotsMeta support: true
INFO  [HttpCrawler] test-crawler: Sitemap support: true
INFO  [HttpCrawler] test-crawler: Canonical links support: true
INFO  [HttpCrawler] test-crawler: User-Agent: <None specified>
INFO  [SitemapStore] test-crawler: Initializing sitemap store...
INFO  [SitemapStore] test-crawler: Done initializing sitemap store.
ERROR [StandardSitemapResolver] Could not obtain sitemap: https://biertastisch.de/sitemap.xml.  Expected status code 200, but got 301
ERROR [StandardSitemapResolver] Could not obtain sitemap: https://biertastisch.de/sitemap_index.xml.  Expected status code 200, but got 301
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] test-crawler: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://biertastisch.de
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://biertastisch.de
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/datenschutz
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/datenschutz
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/copyright
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/copyright
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/warenkorb
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/warenkorb
INFO  [CrawlerEventManager]           REJECTED_FILTER: https://biertastisch.de/info/agb
INFO  [CrawlerEventManager]       REJECTED_ROBOTS_TXT: https://biertastisch.de/info/agb
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://biertastisch.de
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://biertastisch.de
INFO  [AbstractCommitter] Max queue size reached (1). Committing
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [InternalNode] [Scarecrow] version[1.5.0], pid[1496], build[5448160/2015-03-23T14:30:58Z]
INFO  [InternalNode] [Scarecrow] initializing ...
INFO  [PluginsService] [Scarecrow] loaded [], sites []
INFO  [InternalNode] [Scarecrow] initialized
INFO  [InternalNode] [Scarecrow] starting ...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [TransportService] [Scarecrow] bound_address {inet[/0:0:0:0:0:0:0:0:9303]}, publish_address {inet[/192.168.178.27:9303]}
INFO  [DiscoveryService] [Scarecrow] elasticsearch/cpjCKhykSGmGpUAkH-aw2g
INFO  [CrawlerEventManager]            URLS_EXTRACTED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [CrawlerEventManager]         DOCUMENT_IMPORTED: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [AbstractCommitter] Max queue size reached (1). Committing
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [InternalNode] [Artie] version[1.5.0], pid[1496], build[5448160/2015-03-23T14:30:58Z]
INFO  [InternalNode] [Artie] initializing ...
INFO  [PluginsService] [Artie] loaded [], sites []
INFO  [InternalNode] [Artie] initialized
INFO  [InternalNode] [Artie] starting ...
INFO  [TransportService] [Artie] bound_address {inet[/0:0:0:0:0:0:0:0:9304]}, publish_address {inet[/192.168.178.27:9304]}
INFO  [DiscoveryService] [Artie] elasticsearch/EmR8maD2R3e0jyHtjq2RCw
INFO  [InternalClusterService$UpdateTask] [Scarecrow] detected_master [Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]], added {[Belathauzer][HGPuMZyJTK2oLzLgIXnAKw][ordielite][inet[/192.168.178.27:9302]]{client=true, data=false},[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]],[Sweetface][X14YYANjSf25y7d5W8wAlg][ordielite][inet[/192.168.178.27:9301]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [HttpServer] [Scarecrow] bound_address {inet[/0:0:0:0:0:0:0:0:9203]}, publish_address {inet[/192.168.178.27:9203]}
INFO  [InternalNode] [Scarecrow] started
INFO  [InternalClusterService$UpdateTask] [Artie] detected_master [Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]], added {[Belathauzer][HGPuMZyJTK2oLzLgIXnAKw][ordielite][inet[/192.168.178.27:9302]]{client=true, data=false},[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]],[Sweetface][X14YYANjSf25y7d5W8wAlg][ordielite][inet[/192.168.178.27:9301]]{client=true, data=false},[Scarecrow][cpjCKhykSGmGpUAkH-aw2g][ordielite][inet[/192.168.178.27:9303]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [InternalClusterService$UpdateTask] [Scarecrow] added {[Artie][EmR8maD2R3e0jyHtjq2RCw][ordielite][inet[/192.168.178.27:9304]]{client=true, data=false},}, reason: zen-disco-receive(from master [[Book][o73B3EdZQSaYRomBwSeP7A][ordielite][inet[/192.168.178.27:9300]]])
INFO  [HttpServer] [Artie] bound_address {inet[/0:0:0:0:0:0:0:0:9204]}, publish_address {inet[/192.168.178.27:9204]}
INFO  [InternalNode] [Artie] started
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://biertastisch.de/bier-pakete/das-kolner-bucht-paket
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [CrawlerEventManager]    DOCUMENT_COMMITTED_ADD: https://biertastisch.de
INFO  [AbstractCrawler] test-crawler: 3% completed (2 processed/60 total)
INFO  [AbstractCrawler] test-crawler: Maximum documents reached: 1
INFO  [AbstractCrawler] test-crawler: Maximum documents reached: 1
INFO  [AbstractCrawler] test-crawler: Max documents reached. Not reprocessing orphans (if any).
INFO  [AbstractCrawler] test-crawler: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 1 files
INFO  [ElasticsearchCommitter] Sending 1 operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending operations to Elasticsearch.
INFO  [AbstractCrawler] test-crawler: 2 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] test-crawler: Crawler completed.
INFO  [AbstractCrawler] test-crawler: Crawler executed in 10 seconds.
INFO  [MapDBCrawlDataStore] Closing reference store: ./work/crawlstore/mapdb/test-crawler/
INFO  [JobSuite] Running test-crawler: END (Wed Dec 09 18:11:13 CET 2015)

The much longer debug output can be found here. Note that everything following INFO [JobSuite] Running test-crawler: END (Wed Dec 09 18:24:12 CET 2015) is only printed to the console but not appended to the log file.

essiembre commented 8 years ago

Can you try with the new 2.0.2 release which was recently updated to support Elasticsearch 1.7.4.

If you are interested to use Elasticsearch 2.1, there is now a new 2.1.0 snapshot release as well supporting it.

essiembre commented 8 years ago

@niels, is this ticket still relevant or can we close?

pascaldimassimo commented 8 years ago

I'd like to keep it open because I want to reproduce it.

pascaldimassimo commented 8 years ago

I was able to reproduce this issue. Using the latest snapshot of norconex-collector-http-2.4.0, I used the Elasticsearch committer 2.0.2 against Elasticsearch server version 1.7.4 and I got the same issue. But using the latest 2.1.0 snapshot of the committer against Elasticsearch 2.1.1 did not yield the issue. Since we did not specifically address this issue in the version 2.1.0 of the committer, my guest would be that this is an issue in one of the dependency. Maybe that dependency was upgraded to a later version in the latest committer which fixed the issue. I will need to do some tests to confirm (or not) that theory.

pascaldimassimo commented 8 years ago

I did some further tests. I never noticed before that the ES Node client had a close() method. If I call it after committing documents, then the crawler properly terminates with the ES committer 2.0.2. So it seems that when using ES 1.7 within a client application, we must call this close method for the client to properly terminates. And it looks like that ES 2.1 automatically detects that the client is going away and terminates properly without having to explicitly call close (tough it might not be good behavior, because latest documentation still shows to call close when we are done with the client).

So it would be best to add a call to this close method when we are done with the client for both ES committer version 2.0.x (using ES 1.7) and 2.1.x (using ES 2.1). Problem is how do we know when it's time to call this close method. I have not seen a close method or similar on the ICommitter or AbstractCommitter classes. @essiembre any way for a committer to know when crawling is done so it can do some proper cleanup?

essiembre commented 8 years ago

Normally, the commit() is only called at the very end by crawlers. The challenge is most committers extend the AbstractBatchCommitter and call commit once in a while internally during the process. So we would have to change the design a bit so that committer implementations do not invoke commit() directly from internally, but rather a new method such as internalCommit() and then we'll know commit() will only be called at the end. Not sure how feasible it is with the Elasticsearch committer right now without going back and change committer-core (which I think is what will need to happen).

The only "hack" I can imagine in the meantime is copying the content of most of the abstract classes the elasticsearch committer uses and fix this just for the elasticsearch committer.

I will create a new ticket under committer-core to address this.

essiembre commented 7 years ago

The new 4.0.0 snapshot (using REST API) fixes this issue.

essiembre commented 7 years ago

4.0.0 released.