Open samcday opened 9 years ago
Yes, this is why I haven't complained too much about the existing ignore_malformed
option. But I still think this option is dangerous and support for it should be carefully considered. It could cause confusion due to bad scores, slow queries, documents that cannot be found or suspicious evolution of disk usage depending on how many documents are malformed.
I am ok to discuss adding such an option on a per-field basis. For instance I agree this could make sense on geo shapes, as an opt-in. However adding the ignored_malformed
option to all fields would significantly increase the surface of our API and at the same time I'm not convinced it is required. The object vs. non-object case which has been used a couple times as an example here can be worked around by adopting naming conventions that make sure that objects and regular fields cannot have the same field name.
We've had a long discussion about this ticket in our Fix It Friday session without reaching a conclusion. If you are a user who would like this feature, your feedback would be greatly appreciated.
There are two camps:
The first says that ignore_malformed
should be restricted only to use cases where it can be difficult to check validity client-side, eg a string which should be a data, or a malformed geoshape. This doesn't apply to the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string
or similar, which would be compatible with the object mapping.
This first camp says that it is a bad thing to use ignore_malformed, because you may base business decisions on the results of aggregations, without realising that only 10% of your documents have valid values.
The second camp says that some users are doing data exploration, and all they want to do is to get as much messy data loaded as possible so that they can explore, which allows them to build cleanup rules which will get them as close as possible to 100% clean data, possibly without ever reaching the goal of 100%.
What is your story? Why would you like (or not like) ignore_malformed
to support the difference between object and string fields? Why might the ingest pipeline solution not be sufficient? The more detail you can provide, the better.
It would be exceptionally difficult to build an ingest processor that checks for all the types of malformed data that might cause an exception in ES, and by the time the document is rejected it is too late.
Therein lies the rub for me. Most security use-cases involve a large effort to integrate external data sources and write message payload parsers. Formats frequently change during application upgrades and configuration changes. It's a continuous battle to get data clean and keep it clean, one that is never won. Having an option to index partially malformed documents makes detecting said partially erroneous documents significantly easier, because controlled invariants (e.g. standard injected metadata fields) can still be relied upon.
Of course, nothing that can't be achieved with (significant) external code that protects ES during indexing, discovers these errors, and deletes the fields from the documents before retrying.
I confess I haven't fully researched how a pipeline transformation would work, so stop me if I'm missing something obvious, but use an ingest processor to enforce prefixes or suffixes for field names based on their actual type in order to avoid conflicts
is not a viable solution. It presumes a controlled schema to begin with, which is not useful for integrating adhoc external sources that don't follow such conventions.
So, I would cast a vote for the pragmatic over idealized choice here, having suffered at the garbage that manages to sneak into logs in the real world, and knowing first hand how much effort is necessary to work around it.
At the end of the day, it's still an explicit opt-in feature, users can blow their own foot off any number of other ways as well.
Is anyone else seeing any difference in any of the ES metrics when this occurs? We graph all of the ES node metrics via Grafana, and I just can't see any difference when this is happening vs. not.
TL;DR If we have dynamic field mappings, we should have ignore_malformed
as an advanced opt-in option
Elasticsearch shouldn't only work with perfect data, it should do the best it can with imperfect data too.
Based on experience with real-world data sets I agree with the comments that it is often a constant challenge to maintain consistent schemas given application and configuration changes.
Elasticsearch clearly can be configured to be strict, but it's ability to ingest unclean data seems useful and popular.
As mentioned above this could be done on the client side. But if options such as dynamic field mapping and ignore_malformed
are available and are used correctly, they can help users understand data quality and consistency (at scale) via post-index exploration rather than exception logging in client code. For example, if I have 1TB of JSON, I would prefer to ingest it into a system where I can explore and report on inconsistencies, rather than decipher client exceptions (whack-a-mole cleaning). With the partially indexed data, I can at least get an understanding of the data and hopefully some useful insights.
However, I also agree with @jpountz that there are dangers if these features are used without fully understanding their liabilities. Clearly, ignore_malformed
can be dangerous and confusing as it can result in silently unindexed fields that are visible via match_all
but unsearchable and presented as empty strings in Kibana. But it is an advanced opt-in feature.
I'd therefore like to help enhance the documentation and examples in this area to show benefits and liabilities. For example, it would be great to show together examples of strict configuration, dynamic field mapping behaviours, ignore_malformed
behaviours, adding copy_to
to index all as text (even for malformed) and an example ingest processor that can suffix types on conflict!
Why would you like (or not like) ignore_malformed to support the difference between object and string fields?
Finally, my 2¢ is that if we are accepting ignore_malformed
is useful, it should be consistent. Accepting all types (including for example array
) and not object
seems inconsistent. But as objects are different I don't have a strong view on this feature.
@patrick-oyst @subhash @lizhongz For people using the enabled=false
workaround, are you finding that this works for nested objects? I'm setting it on a nested object value, but when I reindex using the new mapping, I'm still seeing ES trying to read all the datatypes of the subfields in that object.
@elastic/es-search-aggs
Wanted to add my voice/usecase.
We use elasticsearch to ingest somewhat unpredictable data stored in mongoDB. It's not uncommon for a single field to store different types across objects, especially a mixture of objects and strings. Because our data is unpredictable (and can have lots of fields defined by client systems we don't control), we rely heavily on elasticsearch's dynamic mapping capability. Of course, anytime an object tries to save a value to a field with an incompatible type, it errors.
This is the correct default behavior, to be sure. And I generally do go back and clean up these inconsistencies once I find them. But as we use elasticsearch/kibana as our primary view into the data, and we don't control the creation of this data, we typically don't know about one of these conflicts until after the mapping has been created. At that point, I "fix" the data in our mongoDB to have consistent types. To fix elasticsearch, however, I have 2 options if the initial mapping wasn't correct:
For option 2, I'd like to be able to use ignore_malformed
to get the reindex to complete, and then I can separately reprocess the objects that had the wrong type in the old index (which I can find using a search on that index). But since ignore_malformed
doesn't work in most of my cases (like objects/strings), this doesn't work for me.
So in short, I agree with those who think ignore_malformed
should work consistently for all types. And it should be clearly documented that this is a dangerous feature that can cause silent data loss (though an awesome improvement would be for elasticsearch to add a "_warnings" field or something that lists all the issues it suppressed for a given object, so we'd have an easy way to identify incomplete objects).
I am of the opinion that one of elasticsearch's biggest selling points is how well it works with unclean/unpredictable data. Making ignore_malformed
work as described would be a big help.
Now that #29494 has been merged, is the intention to pick this issue back up for application to multiple types (specifically object vs. primitive mismatches)?
Curious what the direction is.
I think so. I think I was the main person who voiced concerns about increasing the scope of ignore_malformed
, which are now mitigated with #29494.
@clintongormley In one of your comments, you mention
[…] the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.
Can I ask you to elaborate a bit on how you would detect and fix this in an ingest pipeline? I cannot seem to find any processor that allows me to test if a field is an object or a string. Would I have to resolve to scripting?
@schourode Yes you would have to use scripting.
In case anyone else runs into this issue as well, here is what I used. If the field json.error
(supposed to be an object) is a text string, it's copied to the errormessage
field and then dropped. If it's an object, it remains unchanged.
curl -H 'Content-type: application/json' -s --write-out "%{response_code}" -o reindex.txt -XPOST -u user:pass http://127.0.0.1:9200/_reindex -d '{
"conflicts": "proceed",
"source": {
"index": "terrible-rubbish"
},
"dest": {
"index": "so-shiny"
},
"script": {
"source": "if (ctx._source.json != null && ctx._source.json.error != null && ctx._source.json.error instanceof String) { ctx._source.errormessage = ctx._source.json.remove(\"error\"); }",
"lang": "painless"
}
}'
You need to check whether each parent object element is null, or you'll get an error if you hit an index entry where it is absent.
I was referred here after raising #41372
Please, please, consider which options actual user have.
If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.
If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).
Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).
From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.
One issue to consider during implementation if this does get addressed, dynamic templates currently allow this setting even though it is rejected when directly mapping a field.
Mapping definition for [ignore_bool] has unsupported parameters: [ignore_malformed : true]
:
PUT test
{
"mappings": {
"properties": {
"ignore_bool": {
"type": "boolean",
"ignore_malformed": true
}
}
}
}
ok:
PUT test
{
"mappings": {
"dynamic_templates": [
{
"labels": {
"path_match": "ignore_bools.*",
"match_mapping_type": "string",
"mapping": {
"type": "boolean",
"ignore_malformed": true
}
}
}
]
}
}
The resulting fields are created without issue also.
我使用 springcloud aliaba 集成 es报以下错误
创建索引
es版本 7.6.2 springboot 2.3.4 springcloud alibaba 2.2.3 springcloud Hoxton.SR8
ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Invalid UTF-8 start byte 0xb5 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@762d957b; line: 1, column: 86]]];
@jpountz Are you the decision maker on this? You've argued against it repeatedly, which I disagree with, but if it is what it is then maybe this issue should be closed as a won't-fix? Or am I wrong and it's really still under consideration?
Pinging @elastic/es-search (Team:Search)
In Elastic Observability, we're working on making log ingestion more resilient. In that context, we've discussed how to deal with object/scalar conflicts more gracefully and whether it makes sense to prioritize this issue or whether there are other alternatives.
A little while ago, Elasticsearch has introduced the subobjects
mapping parameter. When setting that parameter to false
, it allows documents to have, for example, both a foo
and a foo.bar
field. Instead of ignoring one of these fields (what's proposed in this issue), both fields can be indexed successfully.
Therefore, we're considering to make subobjects: false
the default in the built-in index template for logs-*-*
in the future. After #97972 has been implemented, this will be a backwards compatible change as Elasticsearch will then be able to accept both nested an flattened keys. One caveat of this mapping parameter, however, is that it doesn't support the nested
field type.
With that in mind, are there still use cases for supporting ignore_malformed
for objects?
@felixbarny - what's the implication there for any existing painless scripts etc which currently rely on iteration over sub-objects?
I've not noticed the new setting before so just taking a cursory glance, but are we now expecting all source documents to be flattened everywhere?
That sounds like a massive efficiency hit when you need to do selective source document filtering, not to mention a fair amount of data bloat on the wire with deeply nested prefixes being repeated?
Hey Eric,
After https://github.com/elastic/elasticsearch/pull/97972 has been implemented, the _source does not need to change. It's just about how the documents are mapped internally. However, as explained on the docs for the object
field type, on the Lucene level, all fields are stored as a flat key/value mapping anyway. In summary, subobjects: false
doesn't affect the _source of the documents, not how they're stored. It's just a different way of how they're mapped and parsed.
If you're unsure if your source documents contain nested or flattened fields, you can use the field API in painless scripts which is able to access fields in either notation. We're also working on adding support for accessing dotted fields in ingest processors: https://github.com/elastic/elasticsearch/issues/96648.
But again, you don't need to change structure of your documents when sending them to Elasticsearch. The idea is that dotted and nested fields are treated equally in all places.
Having said that, in OpenTelemetry, all attributes are by definition a flat key/value pair. As we're continuing to improve the support for OpenTelemetry, we may map OTel data with flattened keys.
selective source document filtering
I'd assume that source filtering using wildcards would still work as expected.
fair amount of data bloat on the wire with deeply nested prefixes being repeated
That's fair. But I'd expect compression to mostly take care of that anyway?
Pinging @elastic/es-storage-engine (Team:StorageEngine)
这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
Indexing a document with an object type on a field that has already been mapped as a string type causes
MapperParsingException
, even ifindex.mapping.ignore_malformed
has been enabled.Reproducible test case
On Elasticsearch 1.6.0:
Expected behaviour
Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when
index.mapping.ignore_malformed
is enabled. Instead, it will ignore the invalid object field.