Closed zgjonbalaj closed 6 years ago
I've managed to get the right files crawled but still getting this error for all PDFs when committing to a new index:
{ "_index": "website-pdf", "_type": "HTML", "_id": "http://www.***.com/site/assets/files/1653/***.pdf", "error": { "reason": "mapper [Date] of different type, current_type [date], merged_type [text]", "type": "illegal_argument_exception" }, "status": 400 },
This probably occurs because you have a field in Elasticsearch defined as a date and you have a document passing a string for that field that is not recognized as a date. Either create a new index with a date field being string, or format your date fields to match a date pattern expected by Elasticsearch. You can have a look at DateFormatTagger from the Importer module as an option.
@zgjonbalaj, did you resolve your date issue? Can we close?
Just to further elaborate on this in case anyone runs into this issue... Just create your index with specific mappings before committing to Elasticsearch it like so:
curl -XPUT 'localhost:9200/index-name?pretty' -H 'Content-Type: application/json' -d'
{
"mappings": {
"my_type": {
"properties": {
"Date": {
"type": "text"
},
"Last-Modified": {
"type": "text"
}
}
}
}
}
'
Where your type is the "typeName" and index-name is the "indexName" part of your xml config.
I've faced the same issue in committing PDF files. I think the issue is coming from the following date differences. Crawler collects only one date field from HTML pages. But there's additional date field from PDFs. Because of this, Elasticsearch treats the date as an array? But the strange part is some PDFs go through committer without any issues. Is there anyway I can kill one of the records (leave only one record)??
Last-Modified = Tue, 23 Jan 2018 01:23:07 GMT Last-Modified = 2017-12-12T17:50:08Z ... Date = Tue, 23 Jan 2018 17:49:35 GMT Date = 2017-12-12T17:50:08Z
@wolverline: Definitely, have a look at KeepOnlyTagger. To be used late in your post-parse handler.
@essiembre: My question was how to remove one of the dates; not one of the tags. In Elasticsearch, the date field looks like the following because of the differences. If from PDFs:
"Date": [ "Thu, 01 Feb 2018 22:40:27 GMT", "2017-12-12T17:50:08Z" ],
If from HTMLs:
"Date": "Thu, 01 Feb 2018 22:29:13 GMT",
I replaced the checksummer with combined fields. But not sure which one can bring more accurate results.
<documentChecksummer class="$MD5Checksummer">
<sourceFields>content, title, description</sourceFields>
<!-- Or, this field?
<sourceFields>Last-Modified</sourceFields>
-->
</documentChecksummer>
@essiembre I apologize for the delay, yes the issue was resolved.
@wolverline, it looks like I gave you the wrong one then. Please try with: ForceSingleValueTagger.
You have two dates probably because there is one in your HTML and one in the HTTP Response headers.
As an alternative, you can also configure the metadataFetcher
in your collector config to add a headers.prefix
. This will differentiate the fields from the HTTP headers from other ones discovered.
Please confirm.
@essiembre I was going to add my comment here about ForceSingleValueTagger before you answered. Yes, it resolved the issue. Thanks.
<importer>
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.ForceSingleValueTagger">
<singleValue field="Date" action="keepFirst" />
<singleValue field="Last-Modified" action="keepFirst" />
</tagger>
</postParseHandlers>
</importer>
Thanks for confirming.
I am having issues isolating different crawlers to different types of documents so i can commit to elasticsearch. I want to be able to utilize the different for pdf, xml, html, images etc. What i would like to do is crawl all HTML pages and commit those under one index, crawl all PDF documents and store under another etc. For some reason the following code is not working, can anyone help?: