Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

Error: java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator; #17

Closed jmrichardson closed 7 years ago

jmrichardson commented 7 years ago

Hello, when running the crawler with multiple threads, I get the following error:

INFO  [AbstractCrawler] WM Search: 100% completed (23343 processed/23343 total)
INFO  [AbstractCrawler] WM Search: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] WM Search: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 124 files
INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Sending 24 commit operations to Elasticsearch.
INFO  [AbstractCrawler] WM Search: Crawler executed in 45 minutes 15 seconds.
FATAL [JobSuite] Fatal error occured in job: WM Search
INFO  [JobSuite] Running WM Search: END (Thu Sep 21 23:08:17 EDT 2017)
FATAL [JobSuite] Job suite execution failed: WM Search
java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.extractResponseErrors(ElasticsearchCommitter.java:493)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.handleResponse(ElasticsearchCommitter.java:469)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:442)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:159)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:233)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:273)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:227)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:183)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)

When I run with 1 thread it completes successfully:

INFO  [AbstractCrawler] WM Search: 100% completed (23343 processed/23343 total)
INFO  [AbstractCrawler] WM Search: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] WM Search: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 124 files
INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Sending 24 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Done sending commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] WM Search: 23343 reference(s) processed.
INFO  [CrawlerEventManager]          CRAWLER_FINISHED
INFO  [AbstractCrawler] WM Search: Crawler completed.
INFO  [AbstractCrawler] WM Search: Crawler executed in 1 hour 4 minutes 34 seconds.
INFO  [JobSuite] Running WM Search: END (Fri Sep 22 12:15:54 EDT 2017)

In both cases, I started clean by removing the index in ES, removing the committer-queue, and workdir files (just to be sure nothing was left over from previous runs). Here is my environment:

[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Filesystem Collector 2.7.2-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Collector Core 1.9.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex JEF 4.1.0 (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Committer Core 2.1.2-SNAPSHOT (Norconex Inc.)
[non-job]: 2017-09-22 12:15:54 INFO - Version: Norconex Committer Elasticsearch 4.0.0 (Norconex Inc.)

and my config file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->

<fscollector id="Text Files">

  <logsDir>${workdir}/logs</logsDir>
  <progressDir>${workdir}/progress</progressDir>

  <crawlers>
    <crawler id="WM Search">

      <workDir>${workdir}</workDir>

      <startPaths>
        <path>c:\Clients</path>
      </startPaths>

      <numThreads>1</numThreads>

      <keepDownloads>false</keepDownloads>

      <documentFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
          (.*\/~.+|.*umbs\.db|.*\.shs|.*\.lnk|.*/\%23.+)
        </filter> 
      </documentFilters>

      <importer>
        <parseErrorsSaveDir>${workdir}/errors</parseErrorsSaveDir>
        <postParseHandlers>

          <tagger class="com.norconex.importer.handler.tagger.impl.ScriptTagger">
            <script><![CDATA[
              metadata.setString('document.filename', 
              metadata.getString('document.reference').replace(/\.[^/.]+$/, "").replace(/^.*[\\\/]/,""));
            ]]></script>
          </tagger>

          <tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger"
              field="crawl_date" format="yyyy-MM-dd HH:mm" />

          <tagger class="com.norconex.importer.handler.tagger.impl.TitleGeneratorTagger"
              overwrite="true" 
              titleMaxLength="60"
              detectHeading="true"
              detectHeadingMinLength="15"
              detectHeadingMaxLength="60"
              sourceCharset="(character encoding)">
          </tagger>
        </postParseHandlers>
      </importer>

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>http://localhost:9200</nodes>
        <indexName>wmsearch</indexName>
        <typeName>doc</typeName>
        <commitBatchSize>100</commitBatchSize>
      </committer>

    </crawler>
  </crawlers>

</fscollector>

I am not sure what is causing this issue. Please advise Thanks in advance

jmrichardson commented 7 years ago

I have reproduced the error by first committing all of my documents using the filesystem committer. Then I ran the ES committer on the queue directory (created from the FS committer). It gives the same error as above but I don't know how to trace where the problem is. I made sure I have the latest of everything. Here is the log of error:

INFO  [AbstractCollectorConfig] Configuration loaded: id=Text Files; logsDir=c:\Elastic\ingest\norconex\workdir\logs; progressDir=c:\Elastic\ingest\norconex\workdir\progress
INFO  [JobSuite] JEF work directory is: c:\Elastic\ingest\norconex\workdir\progress
INFO  [JobSuite] JEF log manager is : FileLogManager
INFO  [JobSuite] JEF job status store is : FileJobStatusStore
INFO  [AbstractCollector] Suite of 1 crawler jobs created.
INFO  [JobSuite] Initialization...
INFO  [JobSuite] Previous execution detected.
INFO  [JobSuite] Backing up previous execution status and log files.
INFO  [JobSuite] Starting execution.
INFO  [AbstractCollector] Version: Norconex Filesystem Collector 2.7.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Collector Core 1.9.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex JEF 4.1.0 (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Core 2.1.2-SNAPSHOT (Norconex Inc.)
INFO  [AbstractCollector] Version: Norconex Committer Elasticsearch 4.0.0 (Norconex Inc.)
INFO  [JobSuite] Running WM Search Commit: BEGIN (Fri Sep 22 18:36:35 EDT 2017)
INFO  [FilesystemCrawler] 0 start paths identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] WM Search Commit: Crawling references...
INFO  [AbstractCrawler] WM Search Commit: Reprocessing any cached/orphan references...
INFO  [AbstractCrawler] WM Search Commit: Crawler finishing: committing documents.
INFO  [AbstractFileQueueCommitter] Committing 11224 files
INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [AbstractCrawler] WM Search Commit: Crawler executed in 7 seconds.
FATAL [JobSuite] Fatal error occured in job: WM Search Commit
INFO  [JobSuite] Running WM Search Commit: END (Fri Sep 22 18:36:35 EDT 2017)
FATAL [JobSuite] Job suite execution failed: WM Search Commit
java.lang.NoSuchMethodError: org.json.JSONArray.iterator()Ljava/util/Iterator;
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.extractResponseErrors(ElasticsearchCommitter.java:493)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.handleResponse(ElasticsearchCommitter.java:469)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:442)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
        at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
        at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
        at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
        at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
        at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:273)
        at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:227)
        at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:183)
        at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
        at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
        at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
        at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
        at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
        at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
        at com.norconex.collector.fs.FilesystemCollector.main(FilesystemCollector.java:76)

Again, the above was for running just the committer with the below command and xml:

collector-fs.bat -a start -c c:\Elastic\ingest\norconex\config\elastic.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>

<fscollector id="Text Files">

  <logsDir>c:\Elastic\ingest\norconex\workdir\logs</logsDir>
  <progressDir>c:\Elastic\ingest\norconex\workdir\progress</progressDir>

  <crawlers>
    <crawler id="WM Search Commit">

      <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
        <nodes>http://localhost:9200</nodes>
        <indexName>wmsearch</indexName>
        <queueDir>c:\commit</queueDir>
        <ignoreResponseErrors>true</ignoreResponseErrors>
        <typeName>doc</typeName>
        <queueSize>9999999</queueSize>
        <commitBatchSize>100</commitBatchSize>
      </committer>

    </crawler>
  </crawlers>

</fscollector>
essiembre commented 7 years ago

This was caused by a conflict between two library dependencies. It was fixed as part of #16. Please confirm.

essiembre commented 7 years ago

Once you installed the new committer snapshot, it is possible the faulty jar is still there within the filesystem collector. A new release of the Filesystem collector should be made soon, but in the meantime, delete this file if you still have the issue after the committer install: json-20160810.jar.

essiembre commented 7 years ago

FYI, a new snapshot of the Filesystem Collector was just released without that conflicting dependency.