elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.52k stars 24.61k forks source link

Should we ignore empty values on keyword fields #34484

Open jimczi opened 5 years ago

jimczi commented 5 years ago

Today when an empty value is passed to a keyword field we store an empty value in the doc_values but not in the inverted lists. To quote @rjernst here:

this seems inconsistent because the doc values should be the inverse of indexed values

It would be weird to index an empty value since there is nothing to normalize/analyze in this case but we could ignore empty value entirely instead.

elasticmachine commented 5 years ago

Pinging @elastic/es-search-aggs

polyfractal commented 5 years ago

Chatted about this internally. The first action item is to document the current behavior so that users are aware of the behavior, regardless of what we decide.

The general sentiment is that we should be consistent here and not store empty strings in DV either. There were some concerns about conflating null/empty since other systems treat them separately. It was unclear how much users rely on empty strings, rather than just tolerating their existence because that's how DV (and other systems) work today.

If we decide to make DV consistent with the inverted index, there are several moving parts that need addressing:

droberts195 commented 3 years ago

/cc @elastic/ml-core because whatever is decided here probably affects ML more than most.

For anomaly detection #60141 is related. We actually have extra complexity within the ML C++ code to carefully treat not present and empty string the same way. We were thinking of removing this because it dates back to when we received data in CSV format, and CSV cannot distinguish not present and empty string (for a field that exists in at least one record in the file). JSON can distinguish not present from empty string, but this issue is proposing that Elasticsearch then mask that difference. There will still be an inconsistency though if people decide to look at _source.

Unlike anomaly detection data frame analytics is more recent functionality that has no special workarounds to make not present and empty string the same. Possibly we might need to change something there if this proposal is implemented.

Obviously the situations where these inconsistencies occur are edge cases and won't affect that many users. But when they do occur we need to make sure they are not causing results that lead to hard-to-diagnose support cases.

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)