elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.2k stars 3.5k forks source link

Use metadata for data_stream_auto_routing #13528

Open smalenfant opened 2 years ago

smalenfant commented 2 years ago

While configuring Logstash for data steam, I noticed that the following would get indexed:

    "data_stream": {
      "dataset": "ds",
      "type": "metrics",
      "namespace": "cdn"
    },

These fields are now duplicated and takes a lot of space. This should be using @metadata instead.

I tried to workaround this but data streams don't accept variables.

logstash  |   output {
logstash  |     elasticsearch {
logstash  |       # This setting must be a ["logs", "metrics", "synthetics"]
logstash  |       # Expected one of ["logs", "metrics", "synthetics"], got ["%{[@data_stream][type]}"]
logstash  |       data_stream_type => "%{[@data_stream][type]}"
logstash  |       ...
logstash  |     }
logstash  |   }

I tried to workaround using the following:

filter {
  mutate {
    add_field => {
      "[@data_stream][type]" => "metrics"
      "[@data_stream][dataset]" => "ds"
      "[@data_stream][namespace]" => "%{[tags][cdn]}"
    }
  }
}

output {

  stdout { }
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    http_compression => true
    sniffing => false
    data_stream => true
    data_stream_auto_routing => false
    data_stream_dataset => "%{[@data_stream][dataset]}"
    data_stream_namespace => "%{[@data_stream][namespace]}"
  }
}

But that didn't work. The variable expansion didn't happen. Index was logs-%{[@data_stream][dataset]}-%{[@data_stream][namespace]}

Maybe I'm doing something wrong. Please let me know if there is a workaround.

kares commented 2 years ago

These fields are now duplicated and takes a lot of space. This should be using @metadata instead.

data_stream.type etc fields are of a constant keyword type (unless the mapping has been changed) - these take up no extra space - the value is stored in the mapping.

data_stream_type option and related in ES output are meant to be static (for now), in case writing to a known DS (type), the dynamic (multi-DS) case is supposed to be handled by relying on whether the event contains. effectively it is a substitute for when the event does not contain DS routing information.

not sure if there's other compelling reasons to support sprintf on these except the false idea to save up space by dropping the datastream.type, datastream.namespace, datastream.dataset keyword fields ...

smalenfant commented 2 years ago

That was key info here... constant keyword. But when using your own mapping, these needs to be configured... I would still love to not see them in the index output if possible.

Is there any ways to remove certain field by default from Kibana queries?

michaelhyatt commented 2 years ago

+1 this can be super handy to dynamically create data streams. I would like to dispatch my events dynamically into a different data stream by specifying a variable value for data_stream_dataset field. Example:

    elasticsearch {
        hosts => "https://10.1.1.45:9200"
        user => "YYY"
        password => "XXX"
        data_stream => "true"
        data_stream_type => "metrics"
        data_stream_dataset => "%{dataset}"
        data_stream_namespace => "snmpwalk"
    }
sbocquet commented 1 year ago

+1 could be a very nice upgrade https://discuss.elastic.co/t/dynamic-naming-of-elasticsearch-data-streams/325278/3

MatheusGelinskiPires commented 4 months ago

+1 as it's a very interesting way to avoid a lot of 'if' conditions on a logstash/ingest pipeline

btw, the use a variable to define a dataset is already possible using the Reroute processor on Ingest pipelines.