Validate and sanitization of dataset and namespace field

ruflin commented 10 months ago

The fields for the data stream naming scheme have rules around what chars are valid. One of the goals it so make sure a pattern like logs-*-* always applies to the right dataset and namespace. Recently, some configs in helm charts were found that look as following:

        streams:
          - id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
            data_stream:
              dataset: ${kubernetes.labels.app.kubernetes.io/name|'container_logs'}.log
              type: logs

This is a valid configuration. But as it turns out, the name above can be for example foo-bar. This means the value for data_stream.dataset becomes foo-bar. But for processors like reroute that match on the data stream name with the pattern logs-*-*, foo is assumed to be the dataset, bar and everything afterwards the namespace.

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

Unfortunately turning on sanitization directly could be considered a breaking change as now suddenly foo-bar would become foo_bar. At the same time, it is important for the system to function that the values are always the same.

Some ideas for migration users is to at least start logging errors (once) and inform users if this happens and later turn it on by default (with an opt out option).

@axw This might also be relevant in the context of ingesting otel data.

elasticmachine commented 10 months ago

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz commented 10 months ago

Users aren't going to be naming their pods while keeping the data stream naming convention in mind, so this is unfortunate. I wonder how common this is, since - is extremely common as the separator in k8s resource names. Perhaps quite common.

Similar to https://github.com/elastic/elastic-agent/issues/3934#issuecomment-1864990034 we could use feature flags to deal with the breaking nature of this problem:

https://github.com/elastic/elastic-agent/blob/a76f4ba6debee80f6e9eb4b201ba824fe60b5566/elastic-agent.reference.yml#L194-L197

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

The other option is for things like the reroute processor to just deal with the fact that not everything is going to adhere perfectly to our conventions since this is already out there (and you can of course write to ES without using Elastic agent).

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

joshdover commented 9 months ago

This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.

+1 - we're generally ok to make breaking changes that are actually bug fixes. We can run this by the breaking change committee if we're still nervous about it but I think the fix warrants the potential negative impact, which should only effect advanced users who have handcrafted a config like this.

Given logs-foo-bar-default rather than logs-*-* parsing this with the dataset as foo and the namespace as bar-default could we consider parsing only the first and last - characters as the separators? This would allow - in the dataset name and you'd correctly parse foo-bar as the dataset. You'd still have problems if someone put - in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).

I think changing the parsing rules here on the ES side is likely to be a much more impactful breaking change, even if scoped only to the DSNS. It's also not exactly clear which piece we should prefer supporting dashes in. IMO the current behavior may actually be preferable to ensure that you can send logs to logs-kubernetes.container_logs-my-k8s-namespace and get to reuse the existing index template for the dataset.

I think a better solution may be to focus on is getting away from shippers specifying the dataset and namespace via the index name and instead sending data to logs-generic-default (or even better, just logs) and leveraging the structured data_stream.* fields to route data via the reroute processor. This allows us to side-step any parsing prioritization decisions and allows the user to do anything they need to do on the collection side, and have the backend make sure the data is always valid for the Stack.

This almost works today OOTB with the default templates, but requires that a reroute processor is added to the logs@custom pipeline:

PUT _ingest/pipeline/logs@custom
{
  "processors": [
    {
      "reroute": {
        "ignore_failure": true
      }
    }
  ]
}

POST /logs-generic-default/_bulk
{ "create": {} }
{ "data_stream": { "dataset": "bar", "namespace": "test-me" }, "message": "foo 3"}
{ "create": {} }
{ "data_stream": { "dataset": "with-dashes", "namespace": "test-me" }, "message": "foo 2"}
{ "create": {} }
{ "data_stream": { "dataset": "foo" }, "message": "foo 1"}

ruflin commented 9 months ago

Taking what Josh has above, the data stream naming scheme would basically be extended with these 2 rules:

If dataset value is not set, auto detection of the dataset value is happening with the logs-*-* pattern and will be set
If the dataset is set, this is the value to be used. So it can contain - and other chars.

What I think this all comes back to is that whenever the data stream naming scheme is used, the data_stream.* fields are set so anything down stream can use these values to make decisions (trying to find the Github issue for linking).

felixbarny commented 9 months ago

I believe this is the issue you're looking for:

https://github.com/elastic/elasticsearch/issues/64991

leehinman commented 1 month ago

If we want to keep existing data stream naming I think we could do the following.

1) Have fleet report an error if any integration is configured to send to a data stream name that doesn't comply with the rules 2) Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

That should catch static config errors (fleet check) and dynamic config errors (variable expansion).

ycombinator commented 1 month ago

Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -

We should probably be consistent with the reroute processor and "encode" the - as a _?

In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.

axw commented 1 month ago

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

leehinman commented 1 month ago

@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209

consistency is important, but I'd like to point out that this is a sub-optimal way to "sanitize". You can't go backwards from this, and you could have accidental name collisions. For example:

a-b_c => a_b_c
and
a_b-c => a_b_c

depending on variable substitution you could end up writing very different things to the same data_stream, and you will be making it harder for users to determine where the data_stream name came from.

elastic / elastic-agent

Validate and sanitization of dataset and namespace field #3946