Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

No error for connecting to elasticsearch even though elasticsearch service is stopped? #26

Closed lemmikens closed 6 years ago

lemmikens commented 6 years ago

Hi, I've been messing around with the elasticsearch committer for a week or so, and for the life of me, I can't get the collector to commit to elasticsearch. There is no error when I run the collector, so it makes it very difficult to troubleshoot... I have a feeling that I'm just missing something small, but it could possibly be a bug.

Below is the xml config:

<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- Requires at least one start URL (or urlsFile).
           Optionally limit crawling to same protocol/domain/port as
           start URLs. -->
      <startURLs stayOnDomain="false" stayOnPort="true" stayOnProtocol="true">
        <url>http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf</url>
        <!--<urlsFile>/home/ec2-user/finalWebsiteList.txt</urlsFile> -->
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/testBucket</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>-1</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="false" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <indexName>brooks-test-app1</indexName>
        <typeName>test</typeName>
        <nodes>http://localhost:9200</nodes>
        <ignoreResponseErrors>false</ignoreResponseErrors>
        <queueSize>10</queueSize>
        <commitBatchSize>10</commitBatchSize>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

and here are the logs when I run it:

INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./examples-output/minimum/logs; progressDir=./examples-output/minimum/progress
INFO  [JobSuite] JEF work directory is: ./examples-output/minimum/progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex HTTP Collector 2.7.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.8.2 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.7.2 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.1 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Elasticsearch 4.0.0 (Norconex Inc.)
INFO  [JobSuite] Running Norconex Minimum Test Page: BEGIN (Mon Nov 20 16:43:33 UTC 2017)
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsTxt support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: RobotsMeta support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Sitemap support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: Canonical links support: true
INFO  [HttpCrawler] Norconex Minimum Test Page: User-Agent: <None specified>
INFO  [SitemapStore] Norconex Minimum Test Page: Initializing sitemap store...
INFO  [SitemapStore] Norconex Minimum Test Page: Done initializing sitemap store.
INFO  [StandardSitemapResolver] Resolving sitemap: http://s3.amazonaws.com/sitemap.xml
INFO  [StandardSitemapResolver]          Resolved: http://s3.amazonaws.com/sitemap.xml
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawling references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/AGENDA-_Biodiversity_Protection-_Implementation_and_Reform_of_the12.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: Reprocessing any cached/orphan references...
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/Coyote_pet-graphic.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 50% completed (2 processed/4 total)
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/coyote-killing-infographic.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 75% completed (3 processed/4 total)
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/jouranimallawvol4_p59.pdf
INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]            URLS_EXTRACTED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [CrawlerEventManager]       REJECTED_UNMODIFIED: http://s3.amazonaws.com/brooks-institute-test-bucket/naturecona92queensland.pdf
INFO  [AbstractCrawler] Norconex Minimum Test Page: 100% completed (5 processed/5 total)
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: 5 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler completed.
INFO  [AbstractCrawler] Norconex Minimum Test Page: Crawler executed in 20 seconds.
INFO  [SitemapStore] Norconex Minimum Test Page: Closing sitemap store...
INFO  [JobSuite] Running Norconex Minimum Test Page: END (Mon Nov 20 16:43:33 UTC 2017)
essiembre commented 6 years ago

I am not sure if that is your main issue, but your logs are not sending anything because every file is being rejected for not being modified since the previous run. The crawler does "incremental" indexing where only documents that are new, modified, or deleted will be sent to your Committer on subsequent runs.

To start fresh and forget about previous runs, delete your "workdir" and start again (or at a minimum, delete the "crawlstore").

Please give it a try and confirm.

lemmikens commented 6 years ago

Deleting the work directory portion of the xml did it! Thank you! It must have been trying to commit to that instead of Elasticsearch

essiembre commented 6 years ago

I was actually suggesting to delete the "workdir" directory, not the config entry. :-) Deleting it reverted to using the default location, which appears as a clean workdir to the Collector. That's fine for now, but if you need to do this again, you will have to delete the directory next time.