Norconex / committer-elasticsearch

Implementation of Norconex Committer for Elasticsearch.
https://opensource.norconex.com/committers/elasticsearch/
Apache License 2.0
11 stars 6 forks source link

id is too long, must be no longer than 512 bytes but was: 520 #25

Closed jacksonp2008 closed 6 years ago

jacksonp2008 commented 6 years ago

Running into an error commiting to elastic. I assume this "_id" from kibana which appears to be the url of the page, & same as "document.reference"

screen shot 2017-11-17 at 15 04 33

INFO  [ElasticsearchCommitter] Sending 100 commit operations to Elasticsearch.
INFO  [ElasticsearchCommitter] Elasticsearch RestClient closed.
INFO  [AbstractCrawler] Anonymous Coward: Crawler executed in 12 seconds.
INFO  [SitemapStore] Anonymous Coward: Closing sitemap store...
ERROR [JobSuite] Execution failed for job: Anonymous Coward
com.norconex.committer.core.CommitterException: Could not commit JSON batch to Elasticsearch.
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:449)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
    at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commit(ElasticsearchCommitter.java:387)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:270)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:226)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: org.elasticsearch.client.ResponseException: POST http://10.80.99.54:9200/_bulk: HTTP/1.1 400 Bad Request
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 520;2: id is too long, must be no longer than 512 bytes but was: 520;3: id is too long, must be no longer than 512 bytes but was: 520;4: id is too long, must be no longer than 512 bytes but was: 520;5: id is too long, must be no longer than 512 bytes but was: 520;6: id is too long, must be no longer than 512 bytes but was: 520;7: id is too long, must be no longer than 512 bytes but was: 520;8: id is too long, must be no longer than 512 bytes but was: 520;9: id is too long, must be no longer than 512 bytes but was: 520;10: id is too long, must be no longer than 512 bytes but was: 558;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 520;2: id is too long, must be no longer than 512 bytes but was: 520;3: id is too long, must be no longer than 512 bytes but was: 520;4: id is too long, must be no longer than 512 bytes but was: 520;5: id is too long, must be no longer than 512 bytes but was: 520;6: id is too long, must be no longer than 512 bytes but was: 520;7: id is too long, must be no longer than 512 bytes but was: 520;8: id is too long, must be no longer than 512 bytes but was: 520;9: id is too long, must be no longer than 512 bytes but was: 520;10: id is too long, must be no longer than 512 bytes but was: 558;"},"status":400}
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:354)
    at org.elasticsearch.client.RestClient$1.completed(RestClient.java:343)
    at org.apache.http.concurrent.BasicFuture.completed(BasicFuture.java:119)
    at org.apache.http.impl.nio.client.DefaultClientExchangeHandlerImpl.responseCompleted(DefaultClientExchangeHandlerImpl.java:177)
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.processResponse(HttpAsyncRequestExecutor.java:436)
    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.inputReady(HttpAsyncRequestExecutor.java:326)
    at org.apache.http.impl.nio.DefaultNHttpClientConnection.consumeInput(DefaultNHttpClientConnection.java:265)
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:81)
    at org.apache.http.impl.nio.client.InternalIODispatch.onInputReady(InternalIODispatch.java:39)
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:114)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:588)
    at java.lang.Thread.run(Thread.java:748)
INFO  [JobSuite] Running Anonymous Coward: END (Fri Nov 17 14:21:41 PST 2017)
jacksonp2008 commented 6 years ago

I am seeing this error on several sites now. Wondering if I need to be doing something to ensure the ID is not greater than 512? (like TruncateTagger?) . This appears to be an elasticsearch hard set limit.

essiembre commented 6 years ago

It is indeed a limit of Elasticsearch 5.x or higher. Using TruncateTagger is a solution. You can also use the UUIDTagger to generate a unique ID with each document. Both these solutions can have a negative impact when detecting deletions. So you may want to use a regular expression to truncate URLs with the GenericURLNormalizer.

I'll mark this as a feature request to have a flag to "fix" this at the Committer level like it was done with the Norconex Amazon CloudSearch committer.

jacksonp2008 commented 6 years ago

Thanks. That sounds good Pascal.

I will also have a look at Cloudsearch, if it is able to accept all my other (logstash) inputs, it might work.

Note, I am seeing this error when running against your site as well. When I list the URL with a DebugTagger for "document.reference", they do not appear to be long which is puzzling.

./collector-http.sh -a start -c examples/minimum/minimum-config.xml | grep document.reference

with attached config. (assuming you have elasticsearch some place you can easily reference)

Mine is:

{
  "name" : "WyVaQSH",
  "cluster_name" : "fsse",
  "cluster_uuid" : "chydLbdbT9iIuhdY0YHAFw",
  "version" : {
    "number" : "5.6.1",
    "build_hash" : "667b497",
    "build_date" : "2017-09-14T19:22:05.189Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

basic config, normalizer is on, & debug tagger. Of course, it runs fine with the standard filesystem committer.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

     <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.norconex.com/</url>
      </startURLs>

  <urlNormalizer  class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
</urlNormalizer>

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>1</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="100" />

      <!-- Document importing -->
      <importer>

        <preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logFields="document.reference" logLevel="INFO" />
        </preParseHandlers>

        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

<!--      
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum/crawledFiles</directory>
      </committer>
-->

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>http://10.80.99.54:9200</nodes>
    <indexName>xweb</indexName>
    <typeName>spiderdata</typeName>
</committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 6 years ago

The latest snapshot of the Elasticsearch Committer now has the new flag to "fix" the ids:

      <fixBadIds>true</fixBadIds>

It will truncate references longer than 512 bytes and append a hash code representing the truncated part.

While this will make sure your pages get indexed properly, you'll need to store the original URL somewhere else. It looks you are already doing so by keeping the document.reference field.

Can you please give it a try and confirm?

Can you share a URL or two you say are too long on our site? One thing that can be confusing, is the Elasticsearch limit is not for the number of characters but rather the number of bytes. References are normally sent to the committers as UTF-8, which can hold characters of variable-length. ASCII characters can be stored with a single byte, but other characters often need more bytes. That being said, I thought all our site URLs only had ASCII characters so I am curious.

jacksonp2008 commented 6 years ago

Thank-you Pascal, I am very impressed with how fast you were able to get this fixed.

Using the 4.1.0 Elasticsearch snapshot, and the also the 2.8.0 collector snapshot I was able to test against three different sites (including yours) and no errors. I let one site process a few thousand URL and they were all present on Kibana. Fantastic!!

This is with the committer config:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
    <nodes>http://10.80.99.54:9200</nodes>
    <indexName>updates</indexName>
    <typeName>spiderdata</typeName>
    <targetContentField>update</targetContentField>
</committer>

If I actually add the tag <fixBadIds>true</fixBadIds> It throws an error:

./collector-http.sh -k -c examples/minimum/fxxxxx.xml 

ERROR (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, targetContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, targetContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
INFO  [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./forescout-output/minimum/logs; progressDir=./forescout-output/minimum/progress
There were 1 XML configuration error(s).

But works great without the tag!

essiembre commented 6 years ago

Glad you have it working, but you just confused me! :-) See, you should be getting the same error without the new flag. Also, the error is just a validation error, but you should not be getting it. Are you sure you are using the latest 4.1.0 snapshot? Did you run the "install" script? Maybe you have duplicate jars (different versions of same jars).

jacksonp2008 commented 6 years ago

I'm thrilled that its's working!

Well, I upgraded the collector first: root@spider1:~/norconex-collector-http-2.8.0-SNAPSHOT# pwd **/home/spollock/norconex-collector-http-2.8.0-SNAPSHOT**

then I upgraded the committer, and told it to overwrite option 4, you can see the jar's below. root@spider1:~# cd norconex-committer-elasticsearch-4.1.0-SNAPSHOT/ root@spider1:~/norconex-committer-elasticsearch-4.1.0-SNAPSHOT# ./install.sh

PLEASE READ CAREFULLY To install this component and its dependencies into another product, please specify the target product directory where libraries (.jar files) can be found. This is often a "lib" directory. For example, to install this component into the Norconex HTTP Collector, specify the full path to the Collector "lib" directory, which may look somewhat like this: /myProject/norconex-collector-http-x.x.x/lib If .jar duplicates are found, you will be asked how you wish to deal with them. It is recommended to try keep most recent versions upon encountering version conflicts. When in doubt, simply choose the default option. Please enter a target directory:

/home/spollock/norconex-collector-http-2.8.0-SNAPSHOT/lib

25 duplicate jar(s) found. How do you want to handle duplicates? For each Jar... 1) Copy source Jar only if greater or same version as target Jar after renaming target Jar (preferred option). 2) Copy source Jar only if greater or same version as target Jar after deleting target Jar. 3) Do not copy source Jar (leave target Jar as is). 4) Copy source Jar regardless of target Jar (may overwrite or cause mixed versions). 5) Let me choose for each files.

Your choice (default = 1): 4


Duplicate:

Source: elasticsearch-rest-client-sniffer-5.6.3.jar

Target: elasticsearch-rest-client-sniffer-5.6.3.jar

Copying "./lib/elasticsearch-rest-client-sniffer-5.6.3.jar".


Duplicate:

Source: commons-collections-3.2.2.jar

Target: commons-collections-3.2.2.jar

Copying "./lib/commons-collections-3.2.2.jar".


Duplicate:

Source: log4j-1.2.17.jar

Target: log4j-1.2.17.jar

Copying "./lib/log4j-1.2.17.jar".


Duplicate:

Source: commons-io-2.5.jar

Target: commons-io-2.5.jar

Copying "./lib/commons-io-2.5.jar".


Duplicate:

Source: httpcore-nio-4.4.5.jar

Target: httpcore-nio-4.4.5.jar

Copying "./lib/httpcore-nio-4.4.5.jar".


Duplicate:

Source: commons-configuration-1.10.jar

Target: commons-configuration-1.10.jar

Copying "./lib/commons-configuration-1.10.jar".


Duplicate:

Source: httpcore-4.4.5.jar

Target: httpcore-4.4.6.jar

Copying "./lib/httpcore-4.4.5.jar".


Duplicate:

Source: jackson-core-2.8.6.jar

Target: jackson-core-2.8.6.jar

Copying "./lib/jackson-core-2.8.6.jar".


Duplicate:

Source: norconex-committer-core-2.1.2-SNAPSHOT.jar

Target: norconex-committer-core-2.1.2-SNAPSHOT.jar

Copying "./lib/norconex-committer-core-2.1.2-SNAPSHOT.jar".


Duplicate:

Source: json-1.8.jar

Target: json-20160810.jar

Copying "./lib/json-1.8.jar".


Duplicate:

Source: xml-apis-1.4.01.jar

Target: xml-apis-1.4.01.jar

Copying "./lib/xml-apis-1.4.01.jar".


Duplicate:

Source: commons-logging-1.2.jar

Target: commons-logging-1.2.jar

Copying "./lib/commons-logging-1.2.jar".


Duplicate:

Source: commons-lang-2.6.jar

Target: commons-lang-2.6.jar

Copying "./lib/commons-lang-2.6.jar".


Duplicate:

Source: httpclient-4.5.2.jar

Target: httpclient-4.5.3.jar

Copying "./lib/httpclient-4.5.2.jar".


Duplicate:

Source: velocity-1.7.jar

Target: velocity-1.7.jar

Copying "./lib/velocity-1.7.jar".


Duplicate:

Source: org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar

Target: org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar

Copying "./lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar".


Duplicate:

Source: xercesImpl-xsd11-2.12-beta-r1667115.jar

Target: xercesImpl-xsd11-2.12-beta-r1667115.jar

Copying "./lib/xercesImpl-xsd11-2.12-beta-r1667115.jar".


Duplicate:

Source: httpasyncclient-4.1.2.jar

Target: httpasyncclient-4.1.2.jar

Copying "./lib/httpasyncclient-4.1.2.jar".


Duplicate:

Source: commons-text-1.1.jar

Target: commons-text-1.1.jar

Copying "./lib/commons-text-1.1.jar".


Duplicate:

Source: commons-collections4-4.1.jar

Target: commons-collections4-4.1.jar

Copying "./lib/commons-collections4-4.1.jar".


Duplicate:

Source: commons-codec-1.10.jar

Target: commons-codec-1.10.jar

Copying "./lib/commons-codec-1.10.jar".


Duplicate:

Source: norconex-commons-lang-1.14.0-SNAPSHOT.jar

Target: norconex-commons-lang-1.14.0-SNAPSHOT.jar

Copying "./lib/norconex-commons-lang-1.14.0-SNAPSHOT.jar".


Duplicate:

Source: elasticsearch-rest-client-5.6.3.jar

Target: elasticsearch-rest-client-5.6.3.jar

Copying "./lib/elasticsearch-rest-client-5.6.3.jar".


Duplicate:

Source: norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar

Target: norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar

Copying "./lib/norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar".


Duplicate:

Source: commons-lang3-3.6.jar

Target: commons-lang3-3.6.jar

Copying "./lib/commons-lang3-3.6.jar".


DONE

essiembre commented 6 years ago

Odd, not sure why you are getting the validation error then. Anyhow, since it works fine for you, I'll close, but do not hesitate to re-open if similar issues re-surface.

jacksonp2008 commented 6 years ago

Ok, I ran into this error again, so need to get the tags working.

Per above installation, if I use the like:

<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
      <nodes>http://10.80.99.54:9200</nodes>
      <indexName>update</indexName>
      <typeName>spiderd</typeName>
      <targetContentField>upd_content</targetContentField>
      <fixBadIds>true</fixBadIds>
    </committer>

it fails validation:

./collector-http.sh -k -c forescout/updates.xml 
Nov 23, 2017 2:25:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

ERROR (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
INFO  [AbstractCollectorConfig] Configuration loaded: id=Forescout HTTP Collector; logsDir=./forescout-output/logs; progressDir=./progress
There were 1 XML configuration error(s).

and if I run it anyway, it doesn't commit anything to elasticsearch.

Thanks for your help

jacksonp2008 commented 6 years ago

Ok, I installed everything again from scratch and it likes the tag. Must have been something I misconfigured the first time.

Krishna210414 commented 6 years ago

How can I fix this in Version 2.7.1 Could someone please provide the snippets.

essiembre commented 5 years ago

@Krishna210414, the fix is highlighted higher in this ticket. You need <fixBadIds>true</fixBadIds> in your elasticsearch committer section. Have you tried it?

jinnabaalu commented 5 years ago

@essiembre I am facing this issue when reindexing data from 2.4.1 - 5.6

How to resolve this issue? any customisation at index level?

essiembre commented 5 years ago

@JinnaBalu, have you tried the suggested fix ("fixBadIds")? If so, what errors do you have now? If the error is not related to IDs being too long, please open a new ticket (as this one is closed).