Open ruflin opened 10 months ago
Pinging @elastic/elastic-agent (Team:Elastic-Agent)
Users aren't going to be naming their pods while keeping the data stream naming convention in mind, so this is unfortunate. I wonder how common this is, since -
is extremely common as the separator in k8s resource names. Perhaps quite common.
Similar to https://github.com/elastic/elastic-agent/issues/3934#issuecomment-1864990034 we could use feature flags to deal with the breaking nature of this problem:
This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.
The other option is for things like the reroute processor to just deal with the fact that not everything is going to adhere perfectly to our conventions since this is already out there (and you can of course write to ES without using Elastic agent).
Given logs-foo-bar-default
rather than logs-*-*
parsing this with the dataset as foo
and the namespace as bar-default
could we consider parsing only the first and last -
characters as the separators? This would allow -
in the dataset name and you'd correctly parse foo-bar
as the dataset. You'd still have problems if someone put -
in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).
This feels more like something that we really should be doing by default though if it is going to cause problems interacting with other parts of the stack. So perhaps in this case we default to sanitization with the option to disable it.
+1 - we're generally ok to make breaking changes that are actually bug fixes. We can run this by the breaking change committee if we're still nervous about it but I think the fix warrants the potential negative impact, which should only effect advanced users who have handcrafted a config like this.
Given
logs-foo-bar-default
rather thanlogs-*-*
parsing this with the dataset asfoo
and the namespace asbar-default
could we consider parsing only the first and last-
characters as the separators? This would allow-
in the dataset name and you'd correctly parsefoo-bar
as the dataset. You'd still have problems if someone put-
in the namespace with this approach though (and I suspect someone has or will use a k8s namespace as as datastream namespace).
I think changing the parsing rules here on the ES side is likely to be a much more impactful breaking change, even if scoped only to the DSNS. It's also not exactly clear which piece we should prefer supporting dashes in. IMO the current behavior may actually be preferable to ensure that you can send logs to logs-kubernetes.container_logs-my-k8s-namespace
and get to reuse the existing index template for the dataset.
I think a better solution may be to focus on is getting away from shippers specifying the dataset and namespace via the index name and instead sending data to logs-generic-default
(or even better, just logs
) and leveraging the structured data_stream.*
fields to route data via the reroute processor. This allows us to side-step any parsing prioritization decisions and allows the user to do anything they need to do on the collection side, and have the backend make sure the data is always valid for the Stack.
This almost works today OOTB with the default templates, but requires that a reroute processor is added to the logs@custom pipeline:
PUT _ingest/pipeline/logs@custom
{
"processors": [
{
"reroute": {
"ignore_failure": true
}
}
]
}
POST /logs-generic-default/_bulk
{ "create": {} }
{ "data_stream": { "dataset": "bar", "namespace": "test-me" }, "message": "foo 3"}
{ "create": {} }
{ "data_stream": { "dataset": "with-dashes", "namespace": "test-me" }, "message": "foo 2"}
{ "create": {} }
{ "data_stream": { "dataset": "foo" }, "message": "foo 1"}
Taking what Josh has above, the data stream naming scheme would basically be extended with these 2 rules:
dataset
value is not set, auto detection of the dataset
value is happening with the logs-*-*
pattern and will be setdataset
is set, this is the value to be used. So it can contain -
and other chars.What I think this all comes back to is that whenever the data stream naming scheme is used, the data_stream.*
fields are set so anything down stream can use these values to make decisions (trying to find the Github issue for linking).
I believe this is the issue you're looking for:
If we want to keep existing data stream naming I think we could do the following.
1) Have fleet report an error if any integration is configured to send to a data stream name that doesn't comply with the rules
2) Have an option to the variable expansion so that it will encode -
as something else, for example you could use the HTML entity -
That should catch static config errors (fleet check) and dynamic config errors (variable expansion).
Have an option to the variable expansion so that it will encode - as something else, for example you could use the HTML entity -
We should probably be consistent with the reroute processor and "encode" the -
as a _
?
In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.
@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209
@ycombinator +1. FWIW APM Server and MIS has logic to sanitise data stream names, where we also use "_" as the replacement. May be worth copying for consistency: https://github.com/elastic/apm-data/blob/c5848755c65e676325c80780c24a14a941e5f176/model/modelprocessor/datastream.go#L197C6-L209
consistency is important, but I'd like to point out that this is a sub-optimal way to "sanitize". You can't go backwards from this, and you could have accidental name collisions. For example:
a-b_c => a_b_c
and
a_b-c => a_b_c
depending on variable substitution you could end up writing very different things to the same data_stream, and you will be making it harder for users to determine where the data_stream name came from.
The fields for the data stream naming scheme have rules around what chars are valid. One of the goals it so make sure a pattern like
logs-*-*
always applies to the right dataset and namespace. Recently, some configs in helm charts were found that look as following:This is a valid configuration. But as it turns out, the
name
above can be for examplefoo-bar
. This means the value fordata_stream.dataset
becomesfoo-bar
. But for processors like reroute that match on the data stream name with the patternlogs-*-*
,foo
is assumed to be the dataset,bar
and everything afterwards the namespace.In the context of the reroute processor, these fields are sanitized. Unfortunately Elastic Agent (Beats) doesn't do this today but should.
Unfortunately turning on sanitization directly could be considered a breaking change as now suddenly
foo-bar
would becomefoo_bar
. At the same time, it is important for the system to function that the values are always the same.Some ideas for migration users is to at least start logging errors (once) and inform users if this happens and later turn it on by default (with an opt out option).
@axw This might also be relevant in the context of ingesting otel data.