Closed kpollich closed 10 months ago
Pinging @elastic/fleet (Team:Fleet)
Our priorities for this are as follows right now:
I'm updating the task list in the description to reflect these next steps.
I pulled all of the APM datasets from the integration source:
logs-apm.app
logs-apm.error
metrics-apm.app
metrics-apm.internal
metrics-apm.service_destination.10m
metrics-apm.service_destination.1m
metrics-apm.service_destination.60m
metrics-apm.service_summary.10m
metrics-apm.service_summary.1m
metrics-apm.service_summary.60m
metrics-apm.service_transaction.10m
metrics-apm.service_transaction.1m
metrics-apm.service_transaction.60m
metrics-apm.transaction.10m
metrics-apm.transaction.1m
metrics-apm.transaction.60m
traces-apm
traces-apm.rum
traces-apm.sampled
I believe the only collisions possible here are
traces-apm
traces-apm.rum
traces-apm.sampled
As traces-apm
is its own data stream we'll have the collision case described in the description above.
Next, I expanded my search to all integration data streams defined in https://github.com/elastic/integrations. Here's the full set of all integration data streams:
From what I understand, the only way a collision like this is possible is when an integration data stream defines a custom dataset
value. Else, the data stream will receive the default name of the form ${type}-${packageName}.${dataStreamDirectory}
which will prevent collisions by nature of directory names being unique.
The list of data streams that explicitly define dataset
in their manifest.yml
is a bit shorter, e.g:
A quick manual glance through these data streams reveals some additional cases where there's a collision issue
# Synthetics
synthetics-browser
synthetics-browser.network
synthetics-browser.screenshot
# Elastic agent
logs-elastic_agent
logs-elastic_agent.apm_server
logs-elastic_agent.auditbeat
logs-elastic_agent.cloud_defend
logs-elastic_agent.cloudbeat
logs-elastic_agent.endpoint_security
logs-elastic_agent.filebeat
logs-elastic_agent.filebeat_input
logs-elastic_agent.fleet_server
logs-elastic_agent.heartbeat
logs-elastic_agent.metricbeat
logs-elastic_agent.osquerybeat
logs-elastic_agent.packetbeat
logs-elastic_agent.pf_elastic_collector
logs-elastic_agent.pf_elastic_symbolizer
logs-elastic_agent.pf_host_agent
@kpollich I did a similar thing where I tried to install all the packages and logged all packages with dataset being the same as the package name I got the following list:
{package}:{dataset}
awsfirehose:awsfirehose
apm:apm
cribl:cribl
apm:apm
elastic_agent:elastic_agent
I think the idea solution here would have been to have something like this for dataset (but we introduce the package @custom
after the dataset one, and we probably did not want to have a breaking change at that point)
name: `${pipeline.dataStream.type}-${pipeline.dataStream.package}-${pipeline.dataStream.dataset}@custom`,
Thanks, Nicolas - that list aligns with my findings, but now I realize that only cases where the dataset begins with the package name value` will result in collisions - so I think the actual conflicting data sets are limited to the APM traces data streams and the Elastic Agent logs data streams above.
For instance, here's what the synthetics ingest pipelines in question look like on 8.12.0
// synthetics-browser-1.1.1
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-browser@custom",
"ignore_missing_pipeline": true
}
}
]
// synthetics-browser.network-1.1.1
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-browser.network@custom",
"ignore_missing_pipeline": true
}
}
]
// synthetics-browser.network-1.1.1
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-browser.network@custom",
"ignore_missing_pipeline": true
}
}
]
// synthetics-browser.screenshot-1.1.1
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-synthetics@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "synthetics-browser.screenshot@custom",
"ignore_missing_pipeline": true
}
}
]
There's no collisions here as synthetics-synthetics
is not a pre-existing dataset like traces-apm
is. This is clear in the ingest pipelines stack management UI:
However, I think the elastic_agent
datasets do have collisions, e.g.
// logs-elastic_agent-1.16.0
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs-elastic_agent@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs-elastic_agent@custom", <-------------
"ignore_missing_pipeline": true
}
}
]
// logs-elastic_agent.filebeat-1.16.0
[
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs-elastic_agent@custom", <-------------
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "logs-elastic_agent.filebeat@custom",
"ignore_missing_pipeline": true
}
}
]
If a user had added a custom ingest pipeline for logs-elastic_agent@custom
in 8.11, after upgrading the 8.12 they would find that events in the logs-elastic_agent.filebeat
data stream (and all other logs-elastic_agent.*` data streams) would also be running through that pipeline unexpectedly.
I think the idea solution here would have been to have something like this for dataset (but we introduce the package
@custom
after the dataset one, and we probably did not want to have a breaking change at that point)name: `${pipeline.dataStream.type}-${pipeline.dataStream.package}-${pipeline.dataStream.dataset}@custom`,
This makes sense to me, so we'd have this for the APM data streams in question instead of what we have now:
traces-apm-apm@custom
traces-apm-apm.rum@custom
traces-apm-apm.sampled@custom
I think it's a little confusing just because the naming is not super clear here on the integration side, but perhaps we can correct that by adding a dynamic description
value to each of these processors, e.g.
// traces-apm.rum-8.12.0
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "Call a global custom pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "Call a custom pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm@custom",
"ignore_missing_pipeline": true,
"description": "Call a custom pipeline for all data streams of type `traces` defined by the `apm` integration
}
},
{
"pipeline": {
"name": "traces-apm-apm.rum@custom",
"ignore_missing_pipeline": true,
"description": "Call a custom pipeline for only the `apm.rum` dataset"
}
}
Adding the package name to the dataset custom will be a breaking change, so we may want to have a way to opt-in, with a config flag?
That's a good point, @nchaulet. What if we leave the traces-apm.rum@custom
dataset-level processors in place, add a description that marks them as deprecated, then update the names of the new, more granular ones introduced in 8.12.0.
Technically there is still room for the same kind of breaking change between 8.12.0 and 8.12.1 if we take this path, but I think the scope is narrow enough that it would be okay. The remediation will be to just rename your pipelines which should be manageable for users I think.
That's a good point, @nchaulet. What if we leave the traces-apm.rum@custom dataset-level processors in place, add a description that marks them as deprecated, then update the names of the new, more granular ones introduced in 8.12.0.
Yes I think it could work
so you will have for apm.rum
traces-apm@custom
traces-apm.rum@custom // deprecated
traces-apm-apm.rum@custom
and for apm (still need to have some deduplication implement right?)
traces-apm@custom
traces-apm-apm@custom
Playing around with an implementation for a fix here and making good progress. The naming is a bit wonky and sort of diverges from the data stream naming convention which I feel is not totally ideal, e.g.
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "logs@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `logs`"
}
},
{
"pipeline": {
"name": "logs-nginx@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `logs` defined by the `nginx` integration"
}
},
{
"pipeline": {
"name": "logs-nginx-nginx.error@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `logs-nginx.error` dataset"
}
},
{
"pipeline": {
"name": "logs-nginx.error@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] (deprecated) Use the `logs-nginx-nginx.error` pipeline instead"
}
}
Or, for some more prudent APM ingest pipelines:
// apm-traces
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm-apm@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `traces-apm` dataset"
}
}
// apm-traces.rum
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm-apm.rum@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `traces-apm.rum` dataset"
}
},
{
"pipeline": {
"name": "traces-apm.rum@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] (deprecated) Use the `traces-apm-apm.rum` pipeline instead"
}
}
I think the naming is a little clunky, but hopefully it's not too confusing with the description in place. Adding the package name as part of the expected pipeline name seems to be our only path to preventing collisions.
Yes the name is a little off we the naming discussion that happened here and a little different from what we have for component template too discussion here , it may be confusing for user (maybe worth getting @felixbarny thoughts here)
Thinking loud here could we have a prefix like .*
for the package one instead:
apm.rum
traces-apm.*@custom // pkg
traces-apm@custom // pkg one deprecated
traces-apm.rum@custom
apm
traces-apm.*@custom // pkg
traces-apm@custom
Not sure this could happen without a breaking change
from https://github.com/elastic/kibana/issues/175254#issuecomment-1906638732
traces-apm@custom traces-apm.rum@custom // deprecated traces-apm-apm.rum@custom
@kpollich @nchaulet why would you deprecate traces-apm.rum@custom
? This is not the conflicting pipeline - the conflicting one is that traces-apm@custom
is applied to the traces-apm.rum-<namespace>
datastream - in the shared example, this would still be the case.
The naming is a bit wonky and sort of diverges from the data stream naming convention which I feel is not totally ideal
Yes the name is a little off we the naming discussion that happened https://github.com/elastic/elasticsearch/issues/96267 and a little different from what we have for component template too discussion https://github.com/elastic/elasticsearch/issues/97664 , it may be confusing for user (maybe worth getting @felixbarny thoughts here)
Thinking loud here could we have a prefix like .* for the package one instead:
I agree that this wouldn't comply with the new naming conventions we've established in https://github.com/elastic/elasticsearch/issues/96267 and it'll probably also be confusing to the user as to which data streams these custom pipelines apply to. Therefore, and because they have been around for longer, I'd bias towards not renaming the custom pipelines for a dataset.
We could declare the names of the new extension points bogus and rename them in a breaking manner.
For example:
traces-apm.package@custom
traces-apm.rum@custom
traces-apm.package@custom
traces-apm@custom
I don't think that a suffix like .*
is a good idea because it may be perceived as "applies to all data streams that match traces-apm.*
.
Or we can keep those around as deprecated that don't cause a conflict. There's also a new deprecated
flag for ingest pipelines that we can leverage: https://www.elastic.co/guide/en/elasticsearch/reference/current/put-pipeline-api.html#put-pipeline-api-request-body
why would you deprecate traces-apm.rum@custom ? This is not the conflicting pipeline - the conflicting one is that traces-apm@custom is applied to the traces-apm.rum-
datastream - in the shared example, this would still be the case.
@simitt - I don't think I agree with this assessment as far as which pipeline pattern is intended to appear here.
We want to support a pattern like traces-apm@custom
that allows users to customize all documents ingested to datastreams of type traces
in the APM
integration. This is the intent laid out in https://github.com/elastic/kibana/issues/168019 by the requested {type}-{integration}@custom
pattern. e,g. "type" = traces
and "integration" = apm
.
So, as far as the Fleet implementation is concerned, the expected behavior is that traces-apm@custom
appears on any datastream with type traces
defiend by the APM integration.
There are real world use cases for this customization, e.g. decorating all logs
produced by a given integration (regardless of dataset) with a custom field or deriving a custom metric for another integration.
To be clear, the list of pipeline processors that appear for all integration as of 8.12.0 aligned with their "patterns" are as follows:
global@custom
${type}@custom
e.g. traces@custom
${type}-${integration}@custom
e.g. traces-apm@custom
${type}-${integration}-${dataset}@custom
pre-existing, e.g. traces-apm.rum@custom
We would deprecate traces-apm.rum
in this example because that's the pattern we'd need to rename to something that can never collide with the ${type}-${integration}@custom
pattern. The root issue here is that the integration
part of this pattern is the same as the dataset
part of this pattern for the traces-apm
datastream. That's why we see the duplication in the traces-apm
ingest pipeline as well.
So, the fix we're proposing above is to deprecate the ${type}-${integration}-${dataset}@custom
pattern and replace it with something that will never collide with the ${type}-${integration}@custom
pattern.
However, @felixbarny's suggestion is more feasible, e.g. this point rings true:
Therefore, and because they have been around for longer, I'd bias towards not renaming the custom pipelines for a dataset.
I'm in agreement with this, so a path forward would be to rename the newer ${type}-${integration}@custom
pattern instead. It's also more defensible to me to simply do away with this new pattern entirely in 8.12.1 with a breaking change notice in the release notes, as they'll only have been available for a few weeks anyway and will likely have near-zero adoption. Therefore, the impact of the breaking change will be massively lower compared to renaming + deprecating the dataset-level custom pipelines.
With this mind, I'm proposing we do the following
${type}-${integration}@custom
pattern with ${type}-${integration}.package@custom
${type}-${integration}@custom
pattern entirelydescription
values are set on all pipeline processors that convey the intentThe deprecated
flag on processors is great to know about, but because the collision case here can be potentially highly impactful and less-than-obvious, I'm in favor of just shipping a breaking change to correct the collision instead. Leaving the colliding processor behind as deprecated
doesn't actually fix anything for impacted users.
There's still technically an edge case with the new @{type}-${integration}.package@custom
pattern where a dataset name and a package name are the same and the dataset happens to end in package
e.g. given a data stream defined as follows
# my_integration/data_streams/foo/package.yml
type: logs
We'd have a datastream pattern of logs-my_integration.package-*
and the pipeline patterns would look like this
logs-my_integration.package@custom
= ${type}-${integration}.package@custom
logs-my_integration.package@custom
= ${type}-${dataset}@custom
You could also footgun yourself by providing a custom dataset
for any data stream that forces the collision case, e.g.
# my_integration/data_streams/foo/bar.yml
type: logs
dataset: my_integration.package
So maybe we have no choice but to add a restriction on dataset naming to the package spec? The collision case here is, I think, less likely than the current implementation but it still exists.
So maybe we have no choice but to add a restriction on dataset naming to the package spec?
That sounds reasonable to me.
I filed https://github.com/elastic/package-spec/issues/699 to capture the package spec change proposed above.
We want to support a pattern like traces-apm@custom that allows users to customize all documents ingested to datastreams of type traces in the APM integration. This is the intent laid out in https://github.com/elastic/kibana/issues/168019 by the requested {type}-{integration}@custom pattern. e,g. "type" = traces and "integration" = apm.
So, as far as the Fleet implementation is concerned, the expected behavior is that traces-apm@custom appears on any datastream with type traces defiend by the APM integration.
That is exactly what I see as the problem and what is breaking the apm use case; {type}-{integration}
was introduced with these pipeline changes, not in alignment with previously agreed on and already established datastream (and derived ingest pipeline) naming patterns of {type}-{dataset}-{namespace}
.
+1 on finding a solution where no deprecation of pre 8.12.0
pipelines would be necessary.
Regarding the proposed solution by @felixbarny and @kpollich, can you clarify how that would look like for the apm case?
Replace the ${type}-${integration}@custom pattern with ${type}-${integration}.package@custom
Would that ultimately lead to the following?
// apm-traces
{
"pipeline": {
"name": "global@custom", //newly introduced
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom", //newly introduced
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.package@custom", //newly introduced
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm@custom", // as it pre-existed before 8.12.0
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `traces-apm` dataset"
}
}
and
// apm-traces.rum
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom", //newly introduced
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.rum.package@custom", //newly introduced
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `traces-apm.rum` dataset"
}
},
{
"pipeline": {
"name": "traces-apm.rum@custom", // as it pre-existed before 8.12.0
"ignore_missing_pipeline": true,
"description": "[Fleet] (deprecated) Use the `traces-apm-apm.rum` pipeline instead"
}
}
@simitt - This is close, but the traces-apm.rum.package@custom
you have in your example would actually be traces-apm.package@custom
. The intent is to allow users to customize all documents of type traces
ingested by the apm
package, regardless of dataset.
Here's what the pipeline processors on these data streams look like on my PR branch - https://github.com/elastic/kibana/pull/175448
// traces-apm
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.package@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `apm` dataset"
}
}
// traces-apm.rum
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.package@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm.rum@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `apm.rum` dataset"
}
}
// traces-apm.sampled
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.package@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm.sampled@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `apm.sampled` dataset"
}
}
FYI with the .package
suffix we do incur one new collision with the system_audit
integration 😞: https://github.com/elastic/package-spec/issues/699#issuecomment-1908696989.
We could use .integration
instead, but it incurs the same opportunity for collision, though we won't have any collision cases today. Maybe that's best to just unblock.
https://github.com/elastic/kibana/pull/175448 has been updated to use .integration
as a suffix instead of .package
. See https://github.com/elastic/kibana/pull/175448#issuecomment-1908719348 for a copy/paste of relevant pipelines.
@simitt - I'll hold off on merging until you can take a look at the above and verify this is acceptable from the APM side.
In the new apm-data Elasticsearch plugin we have the following logic: https://github.com/elastic/elasticsearch/blob/9b4647cfc6d39987cc3fd4f44514bca403d4808f/x-pack/plugin/apm-data/src/main/resources/ingest-pipelines/apm%40default-pipeline.yaml#L35-L56
That is, we invoke:
global@custom
{data_stream.type}@custom
{data_stream.type}-apm@custom
{data_stream.type}-{data_stream.dataset}@custom
(with an exception for service-specific data streams, where we exclude the service name suffix from the dataset)So IIUC we should replace that third one with {data_stream.type}-apm.integration@custom
to be consistent. Is that right?
So IIUC we should replace that third one with {data_stream.type}-apm.integration@custom to be consistent. Is that right?
@simitt and I discussed this just now, and rather than making it consistent I'm going to remove that custom pipeline from the apm-data plugin. The reason is that "integrations" and "packages" no longer make sense, conceptually, when taking Fleet or integrations out of the picture.
@kpollich your proposal looks good from an apm perspective - thanks for finding a non-breaking solution.
Going forward, when moving to the apm plugin we will simply not make use of the {data_stream.type}-apm.integration@custom
pipeline for apm.
@kilfoyle - FYI now that this has landed, I'm going to open a docs issue later today with a draft of what we should include under the breaking change section of the 8.12.1 release notes.
@kpollich Sounds good. Thanks so much for writing that up!
@amolnater-qasource - FYI we updated the names of these pipelines. Not sure if this impacts existing test cases but I wanted to flag this issue to you. See relevant PR + documentation issue above as well.
Hi @kpollich
Thank you for the update.
We have updated 02 testcases for this feature under testrail at links:
We have validated this issue on 8.13.0-SNAPSHOT Kibana build and and had below observations:
Observations:
.integration
suffix is added to {data_stream.type}-{integration}@custom in processors under Ingest pipelinesBuild details: VERSION: 8.13.0 SNAPSHOT BUILD: 71179 COMMIT: b4d93fc145c3c09eb1096c610b7cd736f19f6a3a
Screen-Cast:
Nginx
APM
Further we will revalidate this once latest 8.12.1 BC build is available.
Please let us know if we are missing anything here. Thanks!
Hi Team,
We have revalidated these changes on latest 8.12.1 BC1 kibana cloud environment and found it working fine now.
Observations:
.integration
suffix is added to {data_stream.type}-{integration}@custom
in processors under Ingest pipelines.Screenshot:
Build details: VERSION: 8.12.1 BC1 BUILD: 70228 COMMIT: 3457f326b763887d154c9da00bd4e489221a2ff3
Hence we are marking this as QA:Validated.
Please let us know if anything else is required from our end. Thanks!
8.12.1 fix is working as expected. I confirm the following ingest pipelines do not exhibit the bug in 8.12.0.
traces-apm.sampled-8.12.0
ingest pipeline
[
{
"rename": {
"field": "observer.id",
"target_field": "agent.ephemeral_id",
"ignore_missing": true
}
},
{
"date": {
"field": "_ingest.timestamp",
"formats": [
"ISO8601"
],
"ignore_failure": true,
"output_format": "date_time_no_millis",
"target_field": "event.ingested"
}
},
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "traces-apm@custom",
"ignore_missing_pipeline": true
}
},
{
"pipeline": {
"name": "traces-apm.sampled@custom",
"ignore_missing_pipeline": true
}
}
]
traces-apm.sampled-8.12.1
ingest pipeline
[
{
"rename": {
"field": "observer.id",
"target_field": "agent.ephemeral_id",
"ignore_missing": true
}
},
{
"date": {
"field": "_ingest.timestamp",
"formats": [
"ISO8601"
],
"ignore_failure": true,
"output_format": "date_time_no_millis",
"target_field": "event.ingested"
}
},
{
"pipeline": {
"name": "global@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Global pipeline for all data streams"
}
},
{
"pipeline": {
"name": "traces@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces`"
}
},
{
"pipeline": {
"name": "traces-apm.integration@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for all data streams of type `traces` defined by the `apm` integration"
}
},
{
"pipeline": {
"name": "traces-apm.sampled@custom",
"ignore_missing_pipeline": true,
"description": "[Fleet] Pipeline for the `apm.sampled` dataset"
}
}
]
Summary
In 8.12.0, Fleet introduced new extension points for ingest pipeline customization in the form of additional
pipeline
processors in Fleet-managed ingest pipelines:global@custom
${type}@custom
e.g.logs@custom
${type}-${package}@custom
e.g.logs-nginx@custom
These new extension points allow for more granular customization of ingestion for various use cases, for instance applying global processing across all
logs
data streams.The existing extension point of the pattern
${type}-${dataset}@custom
e.g.logs-apache.logs-my_namespace@custom
is preserved, and is called as the lastpipeline
processor in each Fleet-managed ingest pipeline.Problem 1 - Duplicate pipeline processors
APM defines a
traces-apm
data stream hereBecause the package name
apm
is the same as the datasetapm
, Fleet creates a duplicatepipeline
processor in the final ingest pipeline for this data stream, e.g.In the example above, the first
traces-apm@custom
processor is of the form${type}-${package}@custom
while the second is of the form${type}-${dataset}@custom
. This duplication should be avoided.Problem 2 - Breaking change for
traces-apm.sampled
data streamAPM also defines a
traces-apm.sample
data stream here. Because this data stream extends on thetraces-apm
data stream's name, Fleet's customization hooks introduces a breaking change to its processing scheme.For example, prior to 8.12.0, the
traces-apm.sampled-X.Y.Z
ingest pipeline would have the following pipeline processor defined:Following 8.12.0, this pipeline will now have these processors defined:
The problem (highlighted in the code block above) is that the
traces-apm@custom
processor, which is intended to be of the form${type}-${package}@custom
overlaps with thetraces-apm@custom
pipeline defined for thetraces-apm
data stream above, which is intended to be of the form${type}-${dataset}@custom
. This is technically the same problem as Problem 1 above, but it manifests in a potential breaking change for APM users who have customized their ingest scheme.If an APM user is relying on customizations they made to the
traces-apm@custom
ingest pipeline (which was set up by default in release prior to 8.12.0), they will now unexpectedly see that pipeline firing on data ingested to thetraces-apm.sampled
data stream. This is a breaking change and should be communicated as such.With both problems above, we likely need some kind of additional specificity to avoid the case where a dataset name overlaps with a package name, as is the case with APM. It'd be great to query the integrations repo to see if we can detect other places where this may be the case and alert those teams.
In the immediate term, we need to communicate this as a known issue + breaking change to our users by adding documentation and updating our 8.12.0 release notes. Following that, let's try to come to a decision quickly on how we can fix the root issue with the duplication/lack of specificity.
cc @simitt @lucabelluccini @nchaulet @kilfoyle