elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.71k stars 24.67k forks source link

Support for a fully numeric flattened field #61550

Open jimczi opened 4 years ago

jimczi commented 4 years ago

This issue is a spinoff of #43805 that focuses on a specific use case: supporting numeric fields in the flattened field. We've discussed this internally and agreed that it is something that we'd like to provide. This new field could be considered as the numeric version of the flattened field where all values should be parseable as numbers. The details of the implementation are still unclear but multiple ideas were shared internally:

This issue is a placeholder to provide feedback and updates on the overall plan (supporting a fully numeric flattened field).

elasticmachine commented 4 years ago

Pinging @elastic/es-search (:Search/Mapping)

jpountz commented 4 years ago

Once we have this field, I guess the next question will be how to deal with objects that have a mix of strings and numbers. This makes me wonder whether we should try to fold this functionality into the existing flattened field, or start thinking about whether we could have a sort of wrapper that could redirect fields to either flattened or its numeric variant at both index and search time, e.g. something like that:

{
  "foo": {
    "type": "flattened",
    "numeric_field_pattern": [ "*.count" ]
  }
}

so that an object like

{
  "foo": {
    "tags": [ "x", "y" ],
    "count": 42
  },
  "bar": {
    "tags": [ "x" ],
    "count": 100
  }
}

would have its foo.tags/bar.tags fields indexed and searched with flattened while the foo.count/bar.count fields would be indexed and searched with the numeric variant.

jtibshirani commented 3 years ago

@polyfractal brought up the good point that in some telemetry use cases, all values represent counts. This type of data is similar to a histogram, but with labeled buckets. For example, we could be tracking the usage of every aggregation:

{
  "agg_usage": {
    "terms": 101,
    "date_histogram": 2450,
    ...
  }
}

It would be natural to perform a histogram-like aggregation on agg_usage to sum up the counts for each entry terms, date_histogram, etc. When designing the feature, it'd be good to keep this case in mind -- for example, it could affect whether we want to distinguish long counts vs. arbitrary numerics.

hendrikmuhs commented 3 years ago

it could affect whether we want to distinguish long counts vs. arbitrary numerics

I similar fashion this feature might be useful for ML use cases. It seems to me that being able to specify the sub-type (long, float, double, ...) would be good. For ML these vectors can become huge, but on the other side don't require necessarily a double. Being able to define the sub-type (e.g. float) would be a way to choose between precision and space.

axw commented 3 years ago

Does this issue cover support for histogram and aggregate_metric_double fields? For the APM/Metrics use-case of https://github.com/elastic/elasticsearch/issues/63530, we will need to store basic numbers, histograms, and at some point probably aggregate metrics.

egalpin commented 3 years ago

+1, following. This feature will unblock the ability to remove nested fields in a use case I have 😁

baybatu commented 3 years ago

+1, following. I need to have numeric(float) flattened fields to use on thousands of unique field names with field_value_factor functions.. Currently, I had to increase default mapping count but it's bad practice as doc said.

patodevilla commented 3 years ago

+1, following!

yshyshkin commented 3 years ago

+1. It would really help in storing lots of financial information without a mapping explosion.

Fgerthoffert commented 2 years ago

+1

koenbouwmans commented 1 year ago

+1

vchhabra commented 1 year ago

While this is being worked upon, I am able to way around numeric range query on flattened type leveraging runtime fields at query time ('query time' - as in my case the numeric field names are not known in advance).

Example:

Index Mappings

{
  "flattened_test": {
    "mappings": {
      "properties": {
        "host": {
          "type": "flattened"
        }
      }
    }
  }
}

Sample documents

"host": {
  "hostname": "bionic_1",
  "name": "bionic_1",
  "num_one": 1323
}
---
"host": {
  "hostname": "bionic_2",
  "name": "bionic_2",
  "num_one": 2323
}
---
"host": {
  "hostname": "bionic_3",
  "name": "bionic_3",
  "num_one": 3323
}

Sample Range Query

GET flattened_test/_search
{
  "runtime_mappings": {
    "doc['host.num_one']": {
      "type": "long"
    }
  },

  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "host.num_one": {
              "gte": 4000,
              "lte": 7000
            }
          }
        }
      ]
    }
  }
}

This serves me well for the use-case at hand. And I understand the performance implications of query time runtime fields and the trade-off is acceptable in my case.

However being new to ES, wanted to validate here - if I am over-looking anything obvious or any other feedback?

Thanks,

felixbarny commented 1 year ago

To follow up and update on the use case in Elastic APM (https://github.com/elastic/elasticsearch/issues/61550#issuecomment-772117004):

We're not planning to use flattened. Instead, we'll use subobjects: false at the root of the metric mappings. This will allow ingesting metrics such as connections and connections.idle in the same index, without causing a mapping conflict. Currently, this requires all incoming documents to be flat but the ES team is working on also supporting nested object notations in documents where subobjects are disabled in the mapping: #97972. This makes adding the subobjects: false flag backwards compatible.

I'm sure there are other valid use cases for numeric flattened fields, though, such as avoiding field explosions.

Having said that, we're also working on a new way of dealing with field explosions by ignoring fields that exceed the limit instead of rejecting documents: https://github.com/elastic/elasticsearch/pull/96235

leehaotan commented 6 months ago

+1 need numeric fields in flattened types to be fully supported for range queries

elasticsearchmachine commented 2 months ago

Pinging @elastic/es-search-foundations (Team:Search Foundations)