Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

java.lang.NoSuchFieldError: LUCENE_3_6 #1

Closed yvesnyc closed 9 years ago

yvesnyc commented 9 years ago

I am having trouble with the Elasticsearch committer. The crawler works fine but when it tries to send to Elasticsearch it get an "java.lang.NoSuchFieldError: LUCENE_3_6". I've tried looking around for the source of this error but ran out of ideas.

Here is the Exception output of the crawler:

INFO  - AbstractCrawler            - Crawler #1: Crawling references...
INFO  - AbstractCrawler            - Crawler #1: Deleting orphan references (if any)...
INFO  - AbstractCrawler            - Crawler #1: Deleted 0 orphan URLs...
INFO  - AbstractCrawler            - Crawler #1: Crawler finishing: committing documents.
INFO  - AbstractFileQueueCommitter - Committing 2 files
INFO  - ElasticsearchCommitter     - Sending 2 operations to Elasticsearch.
INFO  - AbstractCrawler            - Crawler #1: Crawler executed in 0 second.
INFO  - MapDBCrawlDataStore        - Closing reference store: ./crawl-output/myapp/crawlstore/mapdb/Crawler_32__35_1/
FATAL - JobSuite                   - Fatal error occured in job: Crawler #1
INFO  - JobSuite                   - Running Crawler #1: END (Mon Jun 01 17:00:57 EDT 2015)
FATAL - JobSuite                   - Job suite execution failed: Crawler #1
java.lang.NoSuchFieldError: LUCENE_3_6
    at org.elasticsearch.Version.<clinit>(Version.java:43)
    at org.elasticsearch.node.internal.InternalNode.<init>(InternalNode.java:138)
    at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:159)
    at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:166)
    at com.norconex.committer.elasticsearch.DefaultClientFactory.createClient(DefaultClientFactory.java:44)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:182)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178)
    at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:232)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:245)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:206)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:352)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:302)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:172)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)

Any ideas?

Thanks

pascaldimassimo commented 9 years ago

Hi @yvesnyc !

Can you provide some details about your setup, please? For example, what version of the Http Collector and Elasticsearch server are you using? Also, if you could share the configuration files of the Http Collector you have, that would be helpful.

Thanks!

yvesnyc commented 9 years ago

Hi @pascaldimassimo ,

Thanks for working on this.

I am using norconex-collector-http-2.1.0 and norconex-committer-elasticsearch-2.0.1 and I'm running Elasticsearch 1.5.2. The configuration file is later modified in the code to set the maxdepth, starting url, and regex filter. Here is the config.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!--
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<httpcollector id="Complex Crawl">

    #set($http = "com.norconex.collector.http")
    #set($core = "com.norconex.collector.core")
    #set($urlNormalizer   = "${http}.url.impl.GenericURLNormalizer")
    #set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
    #set($filterRegexRef  = "${core}.filter.impl.RegexReferenceFilter")

    <progressDir>${workdir}/progress</progressDir>
    <logsDir>${workdir}/logs</logsDir>

    <crawlerDefaults>

        <urlNormalizer class="$urlNormalizer" />
        <numThreads>4</numThreads>
        <maxDepth>2</maxDepth>
        <workDir>$workdir</workDir>
        <orphansStrategy>DELETE</orphansStrategy>
        <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
        <sitemap ignore="true" />

        <committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
            <indexName>RepositoryRequestCrawl</indexName>
            <typeName>webdoc</typeName>
            <clusterName>es-crawler</clusterName>
        </committer>

        <referenceFilters>
            <!-- Note we ignore Power Point and other possible repository desciptors -->
            <filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,sit,eps,wmf,ppt,mpg,xls,rpm,mov,exe,jpeg,bmp</filter>
            <filter class="$filterRegexRef" onMatch="include">http://([a-z0-9]+\.)*redis\.io(/.*)?</filter>
        </referenceFilters>

    </crawlerDefaults>

    <crawlers>

        <crawler id="Norconex Redis Crawl">
            <startURLs>
                <url>http://redis.io</url>
            </startURLs>
        </crawler>

    </crawlers>

</httpcollector>

This is the config.variables file:

workdir = ./crawl-output/mycrawler
pascaldimassimo commented 9 years ago

I think the issue is caused by a Lucene version mismatch. In the norconex-committer-elasticsearch-2.0.1, we include lucene-core-4.10.4.jar, while in norconex-collector-http-2.1.0, we include lucene-core-5.0.0.jar. In lucene 5, Version.LUCENE_3_6 does not exist anymore, hence the NoSuchFieldError. Try removing lucene 5 in your lib folder and use only the lucene jars bundled with the committer to see if that fixes the issue.

@essiembre I see that the importer is now using Lucene 5. Are we using features exclusive to this version?

essiembre commented 9 years ago

It is only bundled with the Importer as a dependency to the TitleGeneratorTagger. If that tagger is not being used, one can remove the Lucene Jar bundled with the Importer without issues. The TitleGeneratorTagger has quite a few dependencies and it maybe it shall be modified to not rely on so many, or maybe making them optional.

yvesnyc commented 9 years ago

Thanks @pascaldimassimo and @essiembre. Yes, the inclusion of the lucene 5.0 jars was the problem.

Specifically, the solution was to remove the lucene-core-5.0.0 and lucene-analyzers-common-5.0.0 jars found in the norconex-collector-http-2.1.0 dependency. This was done with exclude directives in the dependency declaration. The norconex-committer-elasticsearch-2.0.1 dependency pulls in the 4.10.4 versions.