elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.64k stars 24.65k forks source link

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

Open samcday opened 9 years ago

samcday commented 9 years ago

Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.

Reproducible test case

On Elasticsearch 1.6.0:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}

$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}

Expected behaviour

Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.

clintongormley commented 9 years ago

+1

andrestc commented 9 years ago

While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'
[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index             ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch]     at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch]     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch]     at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch]     ... 13 more

Thats happening because, unlike in the string case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT and that comes to bite us later, here.

So, I think two things must be done: (1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue) (2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]"); in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument, when ignoreMalformed is set?

Does this make sense, @clintongormley? I'll happily send a PR for this.

clintongormley commented 9 years ago

ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...

@andrestc i agree with your second point, but i'm unsure about the first...

@rjernst what do you think?

rjernst commented 8 years ago

Sorry for the delayed response, I lost this one in email.

@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.

@andrestc A PR would be great.

abulhol commented 8 years ago

I want to upvote this issue! I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a

 MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]

This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).

@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?

megastef commented 8 years ago

+1

I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more https://github.com/elastic/elasticsearch/issues/3714)

I think this problem could be handled much better:

  1. if a type error appears, try to convert the value (optional server/index setting). Often a JSON has number without quotes (correct), some put numbers as string in quotes. In this case the string could be converted to integer.
  2. if the type does not fit, take a default value for this type (0,null) - or ignore the field as you do today, but very bad if it is a larger object ...
  3. add a comment field like "_es_error_report: MapperParsingException: ...." In that way users can see that there was something wrong, today - data just disappears, when it fails to be indexed or the field is ignored. And the sysadmin might see error message in some logs ... but users wonder that data in elasticsearch is not complete and might have no access to elasticsearch logs. In my case I missed all Apache messages with status code 500 and size "-" instead of 0 - which is really bad - and depends on the log parser ...

A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.

It would be nice when such improvements would be available out of box for all Elasticsearch users.

abulhol commented 8 years ago

any news here?

balooka commented 8 years ago

I wish to upvote the issue too. My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content. In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.

BeccaMG commented 8 years ago

Same problem with dates.

When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:

[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]
satazor commented 8 years ago

Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed would do that but apparently is not working for all cases.

jarlelin commented 8 years ago

We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.

This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.

DaTebe commented 8 years ago

Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?

goodfella1408 commented 8 years ago

Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -

"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}

The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app

BeccaMG commented 8 years ago

I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.

Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored in dates and numbers, should do the trick. Those tricky ones that could have string or nested content could be simply declared as object.

derEremit commented 8 years ago

for usage as a real log stash I would say something like https://github.com/elastic/elasticsearch/issues/12366#issuecomment-175748358 is a must have! I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective

micpotts commented 7 years ago

Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.

patrick-oyst commented 7 years ago

Bump, this is proving to be an extremely tedious (non) feature to work around.

patrick-oyst commented 7 years ago

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled setting of your field to false. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object to a non-object field and vice versa.

Hope this helps. It certainly saved me a lot of trouble...

jarlelin commented 7 years ago

Thats a good trick. Ill try that out.

On 19 Jan 2017 16:01, "patrick-oyst" notifications@github.com wrote:

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set your object's enabled setting to false. This will make the fields non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field.

Hope this helps. It certainly saved me a lot of trouble...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/12366#issuecomment-273798499, or mute the thread https://github.com/notifications/unsubscribe-auth/AGC4v4w0ZIXlOGN40nAgl_8fpy0dj2CUks5rT3rhgaJpZM4Fcpph .

robinjha commented 7 years ago

+1

senseysensor commented 7 years ago

+1

marcovdkuur commented 7 years ago

+1

jonesn commented 7 years ago

+1

rholloway commented 7 years ago

Also an issue on ES 5.2.1. Very frustrating when dealing with some unexpected input that may possibly be malformed.

hartfordfive commented 7 years ago

👍 Would definitely be great to enable the ignore_malformed property for object. I've had many cases of mapping errors due to the fact that someone tried to index a string where a nested object should be and vice versa.

sorinescu commented 7 years ago

👍

aivis commented 7 years ago

👍

EamonHetherton commented 7 years ago

+1

fkoclas commented 7 years ago

👍

flashfm commented 7 years ago

👍

subhashb commented 7 years ago

I had a use case similar to @patrick-oyst and found enabled=false helps me avoid the issue for now.

One additional observation is that ignore_malformed setting worked fine until I did a snapshot/restore on my ES instance a day ago. After the restore, no matter what I did (delete index, clear cache, refresh index patterns, etc.), ES just keeps comparing between old and new types.

mvleandro commented 7 years ago

+1

lizhongz commented 7 years ago

enabled=false works for me.

kislyuk commented 6 years ago

:+1: relates to #10070

roman-parkhunovskyi commented 6 years ago

Quite a useful feature which has been lacking good implementation for too long. And official documentation is incomplete and cheating.

Bharathkumarraju commented 6 years ago

/tmp/elastic_dev/filebeat/current/filebeat -c /tmp/elastic_dev/filebeat/config/filebeat.yml -e
2017/11/23 08:05:06.633737 beat.go:426: INFO Home path: [/tmp/elastic_dev/filebeat/current] Config path: [/tmp/elastic_dev/filebeat/current] Data path: [/tmp/elastic_dev/filebeat/current/data] Logs path: [/tmp/elastic_dev/filebeat/current/logs]
2017/11/23 08:05:06.633916 beat.go:433: INFO Beat UUID: ca5704f8-9b1a-4c94-8766-1dc76b119230
2017/11/23 08:05:06.633952 beat.go:192: INFO Setup Beat: filebeat; Version: 6.0.0
2017/11/23 08:05:06.634604 metrics.go:23: INFO Metrics logging every 30s
2017/11/23 08:05:06.635838 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9200
2017/11/23 08:05:06.636048 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9220
2017/11/23 08:05:06.636161 client.go:123: INFO Elasticsearch url: https://sample.test.raju.com:9230
2017/11/23 08:05:06.636812 module.go:80: INFO Beat name: 10.20.175.66
2017/11/23 08:05:06.641468 beat.go:260: INFO filebeat start running.
2017/11/23 08:05:06.642313 registrar.go:88: INFO Registry file set to: /tmp/elastic_dev/filebeat/current/data/registry
2017/11/23 08:05:06.642475 registrar.go:108: INFO Loading registrar data from /tmp/elastic_dev/filebeat/current/data/registry
2017/11/23 08:05:06.643372 registrar.go:119: INFO States Loaded from registrar: 4
2017/11/23 08:05:06.643439 crawler.go:44: INFO Loading Prospectors: 2
2017/11/23 08:05:06.643746 registrar.go:150: INFO Starting Registrar
2017/11/23 08:05:06.644503 prospector.go:103: INFO Starting prospector of type: log; id: 9119168733948319376
2017/11/23 08:05:06.645260 harvester.go:207: INFO Harvester started for file: /opt/hello1/test_ServiceAudit.log
2017/11/23 08:05:06.645842 prospector.go:103: INFO Starting prospector of type: log; id: 17106901312407876564
2017/11/23 08:05:06.645874 crawler.go:78: INFO Loading and starting Prospectors completed. Enabled prospectors: 2
2017/11/23 08:05:06.648357 harvester.go:207: INFO Harvester started for file: /opt/hello2/test_ProtocolAudit.log
2017/11/23 08:05:07.697281 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:07.700284 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:07.704069 client.go:651: INFO Connected to Elasticsearch version 6.0.0
2017/11/23 08:05:08.722058 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [service-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
2017/11/23 08:05:08.722107 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [protocol-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
Bharathkumarraju commented 6 years ago

I am getting below erroe


2017/11/23 08:05:08.722058 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [service-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
2017/11/23 08:05:08.722107 client.go:465: WARN Can not index event (status=400): {"type":"illegal_argument_exception","reason":"Rejecting mapping update to [protocol-audit-2017.11.23] as the final mapping would have more than 1 type: [log, doc]"}
iamejboy commented 6 years ago

+1

dmabuada commented 6 years ago

+1

piotrkochan commented 6 years ago

+1

Jymit commented 6 years ago

+1

jochia commented 6 years ago

+1

fursich commented 6 years ago

+1

jpountz commented 6 years ago

I don't like the ignore_malformed option, it is a bit like silent data loss since indexed documents cannot be retrieved based on the malformed value. Say a document has all its values malformed for instance, only a match_all query would match it, and nothing is going to warn the user about that. Even an exists query would not match, which might be surprising to some users. I suspect a subsequent feature request would be to add the ability to tag malformed documents so that they could be found later, which I don't like either since the time we index a document is too late in my opinion to deal with this.

I am semi-ok with the current ignore_malformed option is that figuring out whether a date, a geo-point or a number is well-formed on client-side is not easy. But I don't like the idea of making ignore_malformed silently ignore objects, which it doesn't today, even on field types that support the ignore_malformed option. In my opinion we should not expand the scope of this option.

To me the need for this option arises from the lack of a cleanup process earlier in the ingestion process. A better option would be ta have an ingest processor that cleans up malformed values and adds a malformed: true flag to the document so that all malformed documents can be found later.

EricMCornelius commented 6 years ago

Isn't that the point though? Users are explicitly opting in to this functionality, generally in cases where we don't control/know our schema. Happens all the time in security use cases. Certainly we'd love to clean up and verify in advance, but that's not always possible.

Often, the desire is to index a partially dynamic schema which also contains well defined meta/header fields, to enable the best possible level of "schemaless" analysis and data integration. Think structured event log payloads. Right now, these are a liability.

I have written es dynamic mapping retrieval => json schema validator tooling that strips out mismatched fields, to work around this limitation, but that's a substantial amount of work to achieve what would be much more simply handled by dropping non-indexable fields with this setting consistently :(

Essentially, right now there's no safe way to use "dynamic" mapping templates with even partially schemaless data, which seems very unfortunate.

clintongormley commented 6 years ago

I agree with @EricMCornelius. It's great to use ingest to clean things up if you know that your troublesome fields are limited to a few names. The problem is that many users don't have that kind of control over their data - they have to deal with what is thrown their way. Using ingest pipelines for that would be like playing whack-a-mole.

I don't like the ignore_malformed option, it is a bit like silent data loss since indexed documents cannot be retrieved based on the malformed value.

I understand what you mean, but I don't think it is silent. The user has to opt in to ignore_malformed at which stage, all bets are off. It's best effort only. But it is a get-out-of-jail-free card that a significant number of users need in the real world.

jpountz commented 6 years ago

This still sounds like a dangerous game to me. What if a malformed document is the first to introduce a new field. It makes this field totally unusable. Furthermore it's not like we are otherwise happy in case mappings are not under control, for instance we enforce a maximum number of fields, a maximum depth of objects and a maximum number of nested fields.

In the worst-case scenario that you don't know anything about how your documents are formatted, you could still use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts. There wouldn't be any data loss, and I don't think this would be like playing wack-a-mole.

clintongormley commented 6 years ago

In the worst-case scenario that you don't know anything about how your documents are formatted, you could still use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts.

In other words, rewrite our dynamic mapping rules in Painless? Imagine somebody who runs a cluster for users in the org who want to log data. The sysadmin has no control of the data coming in. Somebody sends foo: true then foo.bar: false. This may even be a really minor field, compared to the other fields in the document, but now the whole document is rejected unless this poor sysadmin finds the user responsible and gets them to change their app, or tries to write ingest processors (whack-a-mole style) to cover all these issues.

It would be exceptionally difficult to build an ingest processor that checks for all the types of malformed data that might cause an exception in ES, and by the time the document is rejected it is too late. Also, the default action taken in the presence of malformed data will be to simply delete the field, which is essentially what ignore_malformed does. I can imagine users writing ingest processors in special cases where (a) there is a common malformation limited to one or a few fields, and (b) there is something specific you could do to correct the malformation, but this will be the exception, not the rule.

We already support ignore_malformed, users find it a very useful tool, nobody complains about it being dangerous. The only complaint is that it is not supported by all fields, or the implementation on supported fields is sometimes incomplete.

Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.

jpountz commented 6 years ago

Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.

I disagree. Indexing data and cleaning up data should be separate concerns. Your arguments are based on the assumption that there are a minority of malformed documents and that not indexing some fields is harmless. I'm not willing to trade predictability of Elasticsearch, it's too important, not only for users, but also for those who are in the business of supporting Elasticsearch like us.

You mentioned the poor sysadmin who has to identify the user who sent a malformed document, what about not being able to investigate a production issue because the field that you need has not been indexed for the last 2 days because of a schema change that got silently ignored due to this ignore_malformed leniency?

clintongormley commented 6 years ago

what about not being able to investigate a production issue because the field that you need has not been indexed for the last 2 days because of a schema change that got silently ignored due to this ignore_malformed leniency?

sure, agreed. Like I said, once you opt in to ignore malformed, you take it with all its issues. But think about this common example: Twitter frequently sends illegal geoshapes. If we didn't have ignore malformed support on that field, the user would either have to:

All this for a field which is nice to have, but not required.

This is why I think that ignore malformed is an important tool for users. And, from this issue, it appears that the number of users who have similar problems with object vs non-object is high. Sure, this is an easier one to fix with ingest than geoshapes, but why wouldn't we just extend this feature that we already have to cover all field types?