Closed jacksonp2008 closed 6 years ago
I am seeing this error on several sites now. Wondering if I need to be doing something to ensure the ID is not greater than 512? (like TruncateTagger?) . This appears to be an elasticsearch hard set limit.
It is indeed a limit of Elasticsearch 5.x or higher. Using TruncateTagger is a solution. You can also use the UUIDTagger
to generate a unique ID with each document. Both these solutions can have a negative impact when detecting deletions. So you may want to use a regular expression to truncate URLs with the GenericURLNormalizer
.
I'll mark this as a feature request to have a flag to "fix" this at the Committer level like it was done with the Norconex Amazon CloudSearch committer.
Thanks. That sounds good Pascal.
I will also have a look at Cloudsearch, if it is able to accept all my other (logstash) inputs, it might work.
Note, I am seeing this error when running against your site as well. When I list the URL with a DebugTagger for "document.reference", they do not appear to be long which is puzzling.
./collector-http.sh -a start -c examples/minimum/minimum-config.xml | grep document.reference
with attached config. (assuming you have elasticsearch some place you can easily reference)
Mine is:
{
"name" : "WyVaQSH",
"cluster_name" : "fsse",
"cluster_uuid" : "chydLbdbT9iIuhdY0YHAFw",
"version" : {
"number" : "5.6.1",
"build_hash" : "667b497",
"build_date" : "2017-09-14T19:22:05.189Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}
basic config, normalizer is on, & debug tagger. Of course, it runs fine with the standard filesystem committer.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<httpcollector id="Minimum Config HTTP Collector">
<!-- Decide where to store generated files. -->
<progressDir>./examples-output/minimum/progress</progressDir>
<logsDir>./examples-output/minimum/logs</logsDir>
<crawlers>
<crawler id="Norconex Minimum Test Page">
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<url>https://www.norconex.com/</url>
</startURLs>
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
</urlNormalizer>
<!-- Specify a crawler default directory where to generate files. -->
<workDir>./examples-output/minimum</workDir>
<!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
<maxDepth>1</maxDepth>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<sitemapResolverFactory ignore="true" />
<!-- Be as nice as you can to sites you crawl. -->
<delay default="100" />
<!-- Document importing -->
<importer>
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
logFields="document.reference" logLevel="INFO" />
</preParseHandlers>
<postParseHandlers>
<!-- If your target repository does not support arbitrary fields,
make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>title,keywords,description,document.reference</fields>
</tagger>
</postParseHandlers>
</importer>
<!--
<committer class="com.norconex.committer.core.impl.FileSystemCommitter">
<directory>./examples-output/minimum/crawledFiles</directory>
</committer>
-->
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<nodes>http://10.80.99.54:9200</nodes>
<indexName>xweb</indexName>
<typeName>spiderdata</typeName>
</committer>
</crawler>
</crawlers>
</httpcollector>
The latest snapshot of the Elasticsearch Committer now has the new flag to "fix" the ids:
<fixBadIds>true</fixBadIds>
It will truncate references longer than 512 bytes and append a hash code representing the truncated part.
While this will make sure your pages get indexed properly, you'll need to store the original URL somewhere else. It looks you are already doing so by keeping the document.reference
field.
Can you please give it a try and confirm?
Can you share a URL or two you say are too long on our site? One thing that can be confusing, is the Elasticsearch limit is not for the number of characters but rather the number of bytes. References are normally sent to the committers as UTF-8, which can hold characters of variable-length. ASCII characters can be stored with a single byte, but other characters often need more bytes. That being said, I thought all our site URLs only had ASCII characters so I am curious.
Thank-you Pascal, I am very impressed with how fast you were able to get this fixed.
Using the 4.1.0 Elasticsearch snapshot, and the also the 2.8.0 collector snapshot I was able to test against three different sites (including yours) and no errors. I let one site process a few thousand URL and they were all present on Kibana. Fantastic!!
This is with the committer config:
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<nodes>http://10.80.99.54:9200</nodes>
<indexName>updates</indexName>
<typeName>spiderdata</typeName>
<targetContentField>update</targetContentField>
</committer>
If I actually add the tag <fixBadIds>true</fixBadIds>
It throws an error:
./collector-http.sh -k -c examples/minimum/fxxxxx.xml
ERROR (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, targetContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, targetContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
INFO [AbstractCollectorConfig] Configuration loaded: id=Minimum Config HTTP Collector; logsDir=./forescout-output/minimum/logs; progressDir=./forescout-output/minimum/progress
There were 1 XML configuration error(s).
But works great without the tag!
Glad you have it working, but you just confused me! :-) See, you should be getting the same error without the new flag. Also, the error is just a validation error, but you should not be getting it. Are you sure you are using the latest 4.1.0 snapshot? Did you run the "install" script? Maybe you have duplicate jars (different versions of same jars).
I'm thrilled that its's working!
Well, I upgraded the collector first:
root@spider1:~/norconex-collector-http-2.8.0-SNAPSHOT# pwd
**/home/spollock/norconex-collector-http-2.8.0-SNAPSHOT**
then I upgraded the committer, and told it to overwrite option 4, you can see the jar's below.
root@spider1:~# cd norconex-committer-elasticsearch-4.1.0-SNAPSHOT/
root@spider1:~/norconex-committer-elasticsearch-4.1.0-SNAPSHOT# ./install.sh
PLEASE READ CAREFULLY To install this component and its dependencies into another product, please specify the target product directory where libraries (.jar files) can be found. This is often a "lib" directory. For example, to install this component into the Norconex HTTP Collector, specify the full path to the Collector "lib" directory, which may look somewhat like this: /myProject/norconex-collector-http-x.x.x/lib If .jar duplicates are found, you will be asked how you wish to deal with them. It is recommended to try keep most recent versions upon encountering version conflicts. When in doubt, simply choose the default option. Please enter a target directory:
/home/spollock/norconex-collector-http-2.8.0-SNAPSHOT/lib
25 duplicate jar(s) found. How do you want to handle duplicates? For each Jar... 1) Copy source Jar only if greater or same version as target Jar after renaming target Jar (preferred option). 2) Copy source Jar only if greater or same version as target Jar after deleting target Jar. 3) Do not copy source Jar (leave target Jar as is). 4) Copy source Jar regardless of target Jar (may overwrite or cause mixed versions). 5) Let me choose for each files.
Your choice (default = 1): 4
Duplicate:
Source: elasticsearch-rest-client-sniffer-5.6.3.jar
Target: elasticsearch-rest-client-sniffer-5.6.3.jar
Copying "./lib/elasticsearch-rest-client-sniffer-5.6.3.jar".
Duplicate:
Source: commons-collections-3.2.2.jar
Target: commons-collections-3.2.2.jar
Copying "./lib/commons-collections-3.2.2.jar".
Duplicate:
Source: log4j-1.2.17.jar
Target: log4j-1.2.17.jar
Copying "./lib/log4j-1.2.17.jar".
Duplicate:
Source: commons-io-2.5.jar
Target: commons-io-2.5.jar
Copying "./lib/commons-io-2.5.jar".
Duplicate:
Source: httpcore-nio-4.4.5.jar
Target: httpcore-nio-4.4.5.jar
Copying "./lib/httpcore-nio-4.4.5.jar".
Duplicate:
Source: commons-configuration-1.10.jar
Target: commons-configuration-1.10.jar
Copying "./lib/commons-configuration-1.10.jar".
Duplicate:
Source: httpcore-4.4.5.jar
Target: httpcore-4.4.6.jar
Copying "./lib/httpcore-4.4.5.jar".
Duplicate:
Source: jackson-core-2.8.6.jar
Target: jackson-core-2.8.6.jar
Copying "./lib/jackson-core-2.8.6.jar".
Duplicate:
Source: norconex-committer-core-2.1.2-SNAPSHOT.jar
Target: norconex-committer-core-2.1.2-SNAPSHOT.jar
Copying "./lib/norconex-committer-core-2.1.2-SNAPSHOT.jar".
Duplicate:
Source: json-1.8.jar
Target: json-20160810.jar
Copying "./lib/json-1.8.jar".
Duplicate:
Source: xml-apis-1.4.01.jar
Target: xml-apis-1.4.01.jar
Copying "./lib/xml-apis-1.4.01.jar".
Duplicate:
Source: commons-logging-1.2.jar
Target: commons-logging-1.2.jar
Copying "./lib/commons-logging-1.2.jar".
Duplicate:
Source: commons-lang-2.6.jar
Target: commons-lang-2.6.jar
Copying "./lib/commons-lang-2.6.jar".
Duplicate:
Source: httpclient-4.5.2.jar
Target: httpclient-4.5.3.jar
Copying "./lib/httpclient-4.5.2.jar".
Duplicate:
Source: velocity-1.7.jar
Target: velocity-1.7.jar
Copying "./lib/velocity-1.7.jar".
Duplicate:
Source: org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar
Target: org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar
Copying "./lib/org.eclipse.wst.xml.xpath2.processor-1.1.5-738bb7b85d.jar".
Duplicate:
Source: xercesImpl-xsd11-2.12-beta-r1667115.jar
Target: xercesImpl-xsd11-2.12-beta-r1667115.jar
Copying "./lib/xercesImpl-xsd11-2.12-beta-r1667115.jar".
Duplicate:
Source: httpasyncclient-4.1.2.jar
Target: httpasyncclient-4.1.2.jar
Copying "./lib/httpasyncclient-4.1.2.jar".
Duplicate:
Source: commons-text-1.1.jar
Target: commons-text-1.1.jar
Copying "./lib/commons-text-1.1.jar".
Duplicate:
Source: commons-collections4-4.1.jar
Target: commons-collections4-4.1.jar
Copying "./lib/commons-collections4-4.1.jar".
Duplicate:
Source: commons-codec-1.10.jar
Target: commons-codec-1.10.jar
Copying "./lib/commons-codec-1.10.jar".
Duplicate:
Source: norconex-commons-lang-1.14.0-SNAPSHOT.jar
Target: norconex-commons-lang-1.14.0-SNAPSHOT.jar
Copying "./lib/norconex-commons-lang-1.14.0-SNAPSHOT.jar".
Duplicate:
Source: elasticsearch-rest-client-5.6.3.jar
Target: elasticsearch-rest-client-5.6.3.jar
Copying "./lib/elasticsearch-rest-client-5.6.3.jar".
Duplicate:
Source: norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar
Target: norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar
Copying "./lib/norconex-committer-elasticsearch-4.1.0-SNAPSHOT.jar".
Duplicate:
Source: commons-lang3-3.6.jar
Target: commons-lang3-3.6.jar
Copying "./lib/commons-lang3-3.6.jar".
DONE
Odd, not sure why you are getting the validation error then. Anyhow, since it works fine for you, I'll close, but do not hesitate to re-open if similar issues re-surface.
Ok, I ran into this error again, so need to get the tags working.
Per above installation, if I use the
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<nodes>http://10.80.99.54:9200</nodes>
<indexName>update</indexName>
<typeName>spiderd</typeName>
<targetContentField>upd_content</targetContentField>
<fixBadIds>true</fixBadIds>
</committer>
it fails validation:
./collector-http.sh -k -c forescout/updates.xml
Nov 23, 2017 2:25:32 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
ERROR (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) ElasticsearchCommitter: cvc-complex-type.2.4.a: Invalid content was found starting with element 'fixBadIds'. One of '{sourceReferenceField, targetReferenceField, sourceContentField, commitBatchSize, queueDir, queueSize, maxRetries, maxRetryWait, ignoreResponseErrors, discoverNodes, dotReplacement, username, password, passwordKey, passwordKeySource}' is expected.
INFO [AbstractCollectorConfig] Configuration loaded: id=Forescout HTTP Collector; logsDir=./forescout-output/logs; progressDir=./progress
There were 1 XML configuration error(s).
and if I run it anyway, it doesn't commit anything to elasticsearch.
Thanks for your help
Ok, I installed everything again from scratch and it likes the
How can I fix this in Version 2.7.1 Could someone please provide the snippets.
@Krishna210414, the fix is highlighted higher in this ticket. You need <fixBadIds>true</fixBadIds>
in your elasticsearch committer section. Have you tried it?
@essiembre I am facing this issue when reindexing data from 2.4.1 - 5.6
How to resolve this issue? any customisation at index level?
@JinnaBalu, have you tried the suggested fix ("fixBadIds")? If so, what errors do you have now? If the error is not related to IDs being too long, please open a new ticket (as this one is closed).
Running into an error commiting to elastic. I assume this "_id" from kibana which appears to be the url of the page, & same as "document.reference"