Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

id is too long, must be no longer than 512 bytes but was #489

Closed aleha84 closed 6 years ago

aleha84 commented 6 years ago

Using Norconex HTTP Collector + Elasticsearch commiter.

[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex HTTP Collector 2.7.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex Collector Core 1.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex Importer 2.8.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex JEF 4.1.0-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex Committer Core 2.0.6-SNAPSHOT (Norconex Inc.)
[non-job]: 2018-05-22 14:58:15 INFO - Version: Norconex Committer Elasticsearch 3.0.0-SNAPSHOT (Norconex Inc.)

And begin receiving this exception:

ERROR - Execution failed for job: moexCrawler
org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: id is too long, must be no longer than 512 bytes but was: 612;
    at org.elasticsearch.action.bulk.BulkRequest.validate(BulkRequest.java:543)
    at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:46)
    at org.elasticsearch.client.transport.TransportProxyClient.lambda$execute$0(TransportProxyClient.java:59)
    at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:231)
    at org.elasticsearch.client.transport.TransportProxyClient.execute(TransportProxyClient.java:59)
    at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:345)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:403)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:80)
    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:54)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.sendBulkToES(ElasticsearchCommitter.java:301)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.bulkAddedDocuments(ElasticsearchCommitter.java:257)
    at com.norconex.committer.elasticsearch.ElasticsearchCommitter.commitBatch(ElasticsearchCommitter.java:226)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:179)
    at com.norconex.committer.core.AbstractBatchCommitter.cacheOperationAndCommitIfReady(AbstractBatchCommitter.java:208)
    at com.norconex.committer.core.AbstractBatchCommitter.commitAddition(AbstractBatchCommitter.java:143)
    at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:222)
    at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:267)
    at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:228)
    at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:189)
    at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
    at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:355)
    at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:296)
    at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:168)
    at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:132)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:95)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

This is ES 5.0 hardcoded limitation. https://discuss.elastic.co/t/maximum-length-of-a-specified-document-id/4262/2

How can i ignore urls longer then 512 letters?

essiembre commented 6 years ago

On your elasticsearch committer, have you tried setting <fixBadIds>true</fixBadIds> (available since Committer version 4.1.0)?

It will truncate what will go in Elasticsearch ID field and append a hash code to keep it unique, but you can store/copy your intact URL in another field beforehand (e.g. with CopyTagger of the Importer module).

If you really want to eliminate them, you can look at using a RegexReferenceFilter and filter out long URL with a regex like this (untested):

^.{512,}$
aleha84 commented 6 years ago

^.{512,}$

looks like it works

essiembre commented 6 years ago

Thanks for confirming.