elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.13k stars 24.84k forks source link

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

Open samcday opened 9 years ago

samcday commented 9 years ago

Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.

Reproducible test case

On Elasticsearch 1.6.0:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}

$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}

Expected behaviour

Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.

jpountz commented 6 years ago

Yes, this is why I haven't complained too much about the existing ignore_malformed option. But I still think this option is dangerous and support for it should be carefully considered. It could cause confusion due to bad scores, slow queries, documents that cannot be found or suspicious evolution of disk usage depending on how many documents are malformed.

I am ok to discuss adding such an option on a per-field basis. For instance I agree this could make sense on geo shapes, as an opt-in. However adding the ignored_malformed option to all fields would significantly increase the surface of our API and at the same time I'm not convinced it is required. The object vs. non-object case which has been used a couple times as an example here can be worked around by adopting naming conventions that make sure that objects and regular fields cannot have the same field name.

clintongormley commented 6 years ago

We've had a long discussion about this ticket in our Fix It Friday session without reaching a conclusion. If you are a user who would like this feature, your feedback would be greatly appreciated.

There are two camps:

The first says that ignore_malformed should be restricted only to use cases where it can be difficult to check validity client-side, eg a string which should be a data, or a malformed geoshape. This doesn't apply to the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.

This first camp says that it is a bad thing to use ignore_malformed, because you may base business decisions on the results of aggregations, without realising that only 10% of your documents have valid values.

The second camp says that some users are doing data exploration, and all they want to do is to get as much messy data loaded as possible so that they can explore, which allows them to build cleanup rules which will get them as close as possible to 100% clean data, possibly without ever reaching the goal of 100%.

What is your story? Why would you like (or not like) ignore_malformed to support the difference between object and string fields? Why might the ingest pipeline solution not be sufficient? The more detail you can provide, the better.

EricMCornelius commented 6 years ago

It would be exceptionally difficult to build an ingest processor that checks for all the types of malformed data that might cause an exception in ES, and by the time the document is rejected it is too late.

Therein lies the rub for me. Most security use-cases involve a large effort to integrate external data sources and write message payload parsers. Formats frequently change during application upgrades and configuration changes. It's a continuous battle to get data clean and keep it clean, one that is never won. Having an option to index partially malformed documents makes detecting said partially erroneous documents significantly easier, because controlled invariants (e.g. standard injected metadata fields) can still be relied upon.

Of course, nothing that can't be achieved with (significant) external code that protects ES during indexing, discovers these errors, and deletes the fields from the documents before retrying.

I confess I haven't fully researched how a pipeline transformation would work, so stop me if I'm missing something obvious, but use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts is not a viable solution. It presumes a controlled schema to begin with, which is not useful for integrating adhoc external sources that don't follow such conventions.

So, I would cast a vote for the pragmatic over idealized choice here, having suffered at the garbage that manages to sneak into logs in the real world, and knowing first hand how much effort is necessary to work around it.

At the end of the day, it's still an explicit opt-in feature, users can blow their own foot off any number of other ways as well.

jcmcken commented 6 years ago

Is anyone else seeing any difference in any of the ES metrics when this occurs? We graph all of the ES node metrics via Grafana, and I just can't see any difference when this is happening vs. not.

stevedodson commented 6 years ago

TL;DR If we have dynamic field mappings, we should have ignore_malformed as an advanced opt-in option

Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.

Based on experience with real-world data sets I agree with the comments that it is often a constant challenge to maintain consistent schemas given application and configuration changes.

Elasticsearch clearly can be configured to be strict, but it's ability to ingest unclean data seems useful and popular.

As mentioned above this could be done on the client side. But if options such as dynamic field mapping and ignore_malformed are available and are used correctly, they can help users understand data quality and consistency (at scale) via post-index exploration rather than exception logging in client code. For example, if I have 1TB of JSON, I would prefer to ingest it into a system where I can explore and report on inconsistencies, rather than decipher client exceptions (whack-a-mole cleaning). With the partially indexed data, I can at least get an understanding of the data and hopefully some useful insights.

However, I also agree with @jpountz that there are dangers if these features are used without fully understanding their liabilities. Clearly, ignore_malformed can be dangerous and confusing as it can result in silently unindexed fields that are visible via match_all but unsearchable and presented as empty strings in Kibana. But it is an advanced opt-in feature.

I'd therefore like to help enhance the documentation and examples in this area to show benefits and liabilities. For example, it would be great to show together examples of strict configuration, dynamic field mapping behaviours, ignore_malformed behaviours, adding copy_to to index all as text (even for malformed) and an example ingest processor that can suffix types on conflict!

Why would you like (or not like) ignore_malformed to support the difference between object and string fields?

Finally, my 2¢ is that if we are accepting ignore_malformed is useful, it should be consistent. Accepting all types (including for example array) and not object seems inconsistent. But as objects are different I don't have a strong view on this feature.

jcmcken commented 6 years ago

@patrick-oyst @subhash @lizhongz For people using the enabled=false workaround, are you finding that this works for nested objects? I'm setting it on a nested object value, but when I reindex using the new mapping, I'm still seeing ES trying to read all the datatypes of the subfields in that object.

javanna commented 6 years ago

@elastic/es-search-aggs

blfrantz commented 6 years ago

Wanted to add my voice/usecase.

We use elasticsearch to ingest somewhat unpredictable data stored in mongoDB. It's not uncommon for a single field to store different types across objects, especially a mixture of objects and strings. Because our data is unpredictable (and can have lots of fields defined by client systems we don't control), we rely heavily on elasticsearch's dynamic mapping capability. Of course, anytime an object tries to save a value to a field with an incompatible type, it errors.

This is the correct default behavior, to be sure. And I generally do go back and clean up these inconsistencies once I find them. But as we use elasticsearch/kibana as our primary view into the data, and we don't control the creation of this data, we typically don't know about one of these conflicts until after the mapping has been created. At that point, I "fix" the data in our mongoDB to have consistent types. To fix elasticsearch, however, I have 2 options if the initial mapping wasn't correct:

  1. Start over from mongoDB. This is expensive as our dataset is fairly large.
  2. Reindex elasticsearch into an index with the correct mapping specified. But this fails once it hits an object with the wrong type in the old index.

For option 2, I'd like to be able to use ignore_malformed to get the reindex to complete, and then I can separately reprocess the objects that had the wrong type in the old index (which I can find using a search on that index). But since ignore_malformed doesn't work in most of my cases (like objects/strings), this doesn't work for me.

So in short, I agree with those who think ignore_malformed should work consistently for all types. And it should be clearly documented that this is a dangerous feature that can cause silent data loss (though an awesome improvement would be for elasticsearch to add a "_warnings" field or something that lists all the issues it suppressed for a given object, so we'd have an easy way to identify incomplete objects).

I am of the opinion that one of elasticsearch's biggest selling points is how well it works with unclean/unpredictable data. Making ignore_malformed work as described would be a big help.

EricMCornelius commented 6 years ago

Now that #29494 has been merged, is the intention to pick this issue back up for application to multiple types (specifically object vs. primitive mismatches)?

Curious what the direction is.

jpountz commented 6 years ago

I think so. I think I was the main person who voiced concerns about increasing the scope of ignore_malformed, which are now mitigated with #29494.

schourode commented 6 years ago

@clintongormley In one of your comments, you mention

[…] the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.

Can I ask you to elaborate a bit on how you would detect and fix this in an ingest pipeline? I cannot seem to find any processor that allows me to test if a field is an object or a string. Would I have to resolve to scripting?

clintongormley commented 6 years ago

@schourode Yes you would have to use scripting.

cafuego commented 5 years ago

In case anyone else runs into this issue as well, here is what I used. If the field json.error (supposed to be an object) is a text string, it's copied to the errormessage field and then dropped. If it's an object, it remains unchanged.

curl -H 'Content-type: application/json' -s --write-out "%{response_code}" -o reindex.txt -XPOST -u user:pass http://127.0.0.1:9200/_reindex -d '{
  "conflicts": "proceed",
  "source": {
    "index": "terrible-rubbish"
  },
  "dest": {
    "index": "so-shiny"
  },
  "script": {
    "source": "if (ctx._source.json != null && ctx._source.json.error != null && ctx._source.json.error instanceof String) { ctx._source.errormessage = ctx._source.json.remove(\"error\"); }",
    "lang": "painless"
  }
}'

You need to check whether each parent object element is null, or you'll get an error if you hit an index entry where it is absent.

Mekk commented 5 years ago

I was referred here after raising #41372

Please, please, consider which options actual user have.

If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.

If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).

Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).

From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.

graphaelli commented 5 years ago

One issue to consider during implementation if this does get addressed, dynamic templates currently allow this setting even though it is rejected when directly mapping a field.

Mapping definition for [ignore_bool] has unsupported parameters: [ignore_malformed : true]:

PUT test
{
  "mappings": {
    "properties": {
      "ignore_bool": {
        "type": "boolean",
        "ignore_malformed": true
      }
    }
  }
}

ok:

PUT test
{
  "mappings": {
    "dynamic_templates": [
      {
        "labels": {
          "path_match": "ignore_bools.*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "boolean",
            "ignore_malformed": true
          }
        }
      }
    ]
  }
}

The resulting fields are created without issue also.

tinarooot commented 3 years ago

我使用 springcloud aliaba 集成 es报以下错误

创建索引

es版本 7.6.2 springboot 2.3.4 springcloud alibaba 2.2.3 springcloud Hoxton.SR8

ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Invalid UTF-8 start byte 0xb5 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@762d957b; line: 1, column: 86]]];

tmenier commented 2 years ago

@jpountz Are you the decision maker on this? You've argued against it repeatedly, which I disagree with, but if it is what it is then maybe this issue should be closed as a won't-fix? Or am I wrong and it's really still under consideration?

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

felixbarny commented 1 year ago

In Elastic Observability, we're working on making log ingestion more resilient. In that context, we've discussed how to deal with object/scalar conflicts more gracefully and whether it makes sense to prioritize this issue or whether there are other alternatives.

A little while ago, Elasticsearch has introduced the subobjects mapping parameter. When setting that parameter to false, it allows documents to have, for example, both a foo and a foo.bar field. Instead of ignoring one of these fields (what's proposed in this issue), both fields can be indexed successfully.

Therefore, we're considering to make subobjects: false the default in the built-in index template for logs-*-* in the future. After #97972 has been implemented, this will be a backwards compatible change as Elasticsearch will then be able to accept both nested an flattened keys. One caveat of this mapping parameter, however, is that it doesn't support the nested field type.

With that in mind, are there still use cases for supporting ignore_malformed for objects?

EricMCornelius commented 1 year ago

@felixbarny - what's the implication there for any existing painless scripts etc which currently rely on iteration over sub-objects?

I've not noticed the new setting before so just taking a cursory glance, but are we now expecting all source documents to be flattened everywhere?

That sounds like a massive efficiency hit when you need to do selective source document filtering, not to mention a fair amount of data bloat on the wire with deeply nested prefixes being repeated?

felixbarny commented 1 year ago

Hey Eric,

After https://github.com/elastic/elasticsearch/pull/97972 has been implemented, the _source does not need to change. It's just about how the documents are mapped internally. However, as explained on the docs for the object field type, on the Lucene level, all fields are stored as a flat key/value mapping anyway. In summary, subobjects: false doesn't affect the _source of the documents, not how they're stored. It's just a different way of how they're mapped and parsed.

If you're unsure if your source documents contain nested or flattened fields, you can use the field API in painless scripts which is able to access fields in either notation. We're also working on adding support for accessing dotted fields in ingest processors: https://github.com/elastic/elasticsearch/issues/96648.

But again, you don't need to change structure of your documents when sending them to Elasticsearch. The idea is that dotted and nested fields are treated equally in all places.

Having said that, in OpenTelemetry, all attributes are by definition a flat key/value pair. As we're continuing to improve the support for OpenTelemetry, we may map OTel data with flattened keys.

selective source document filtering

I'd assume that source filtering using wildcards would still work as expected.

fair amount of data bloat on the wire with deeply nested prefixes being repeated

That's fair. But I'd expect compression to mostly take care of that anyway?

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)

tinarooot commented 4 months ago

这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。