Closed joshdover closed 1 year ago
Pinging @elastic/fleet (Team:Fleet)
How will we manage the dependency web between packages during installation?
We are thinking it of a dependency because it is the same ingest pipeline for the first implementation. But ideally, there would be a routing API and integrations would just add their bits to it, no necessarily creating a dependency. What if we build this API in Fleet for now that manages the rules and creates the pipeline out of it? What does it mean for the package spec? I'm hoping we can stay away mostly from having dependencies.
Yeah to be clear, I didn't mean that packages would declare dependencies on one another. But there will be things to consider to ensure that every time a package is installed, all the appropriate pipelines are updated.
I synced on this today with Observability and discussed what the next steps should be. We came to the conclusion that we should work on a design document that includes the following
We need to allow packages to define routing rules for how data should be routed from other integration data streams to the integration defining the rule. For example, the Nginx integration should be able to define a rule for the Docker integration on how to detect a document that contains and nginx log and route it to the nginx.access data stream.
While the underlying implementation of the routing rule will be part of the integration's ingest pipeline, we want Fleet/EPM to have control over the order of when routing decisions happen. For these 2 reasons, routing rules should not be defined as part of an integration data stream's regular ingest pipeline. Instead they need to be defined in a separate file in the package.
We should also not abstract away too much of the underlying reroute
processor's options. Integrations should be able to use the full power of that processor's capabilities, including using painless to define the rule's condition.
We need to specify the order in which the following things will be executed in the pipeline:
data_stream.dataset
in the document doesn't match the current data stream.We need to strike a balance between allow flexibility and preventing user customization from conflicting with processing and routing defined by integrations.
This design needs to be tested against real use cases (to be provided by Observability).
Part of this should also include the naming convention used for the different pipelines and sub-pipelines (eg @routing
).
Users need to be able to add their own routing rules. We may want to offer an API in Fleet/EPM to make manging this simpler. Later a UI may be built on top of this.
The design of this API should include how ingest pipelines are updated to apply the new rules or remove ones that are deleted
We need to define how Fleet will internally track rules defined in packages and by end users, and how those rules will be installed into ingest pipelines with 0 disruption to ingestion.
This needs to include how package installation will work when new rules are added, how package upgrades will work, and how package uninstallation will work.
@joshdover I'd be interested in what we see the user experience being here.
As a user with an nginx docker container in kubernetes, I might be tempted to first look for the nginx integration for example, but the standard flow would then prompt me for the log paths etc which is unnecessary.
or if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?
It's almost like we need a dedicated kubernetes onboarding which guides the user to select the kinds of containers they will be deploying and installs the matching integrations first then the kubernetes integration, that way the user starts re-routing data straight away.
As a user I would also want to see if my data is successfully being re-routed, we could consider adding a field when we re-route the data to track the origin of the data and we could then aggregate on it to give the user a summary somewhere of data that is being re-routed.
Good points @hop-dev. I'd hope we can come up with a solution that doesn't require the user to manually install integrations. Instead, we'd either pre-install or auto-install the integration when data flows to a corresponding data stream. IIRC, Josh did a spacetime on this.
Ah yes I was involved in a spacetime project in this area: https://github.com/elastic/observability-dev/issues/2100
Slightly different to this use case, we used annotations on the containers themselves to prompt the installation of the integration. then configured the integration to consume the data directly
Actually @hop-dev did the auto-install spacetime :) I did a related one to suggest packages to install based on signals in the common data. Both could be potential options here.
I suspect this is important as well, and ideally we don't build a solution that only works for Kubernetes sources.
I wonder if we could do some 80/20 solution where we pre-install just the ES assets for integrations that define routing rules for another integration. We can then watch for data showing up in those data streams and install the dashboards/other Kibana objects at that time.
pre-install just the ES assets for integrations that define routing rules for another integration. We can then watch for data showing up in those data streams and install the dashboards/other Kibana objects at that time.
This makes a lot of sense to me
I wonder if we could do some 80/20 solution where we pre-install just the ES assets for integrations that define routing rules for another integration.
Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?
For @hop-dev's user journey question that would imply that the routing configuration is tied to the lifecycle of the source integration (such as k8s).
Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?
I think we're saying the same thing? So if a user installs the k8s integration, Fleet would install the ES assets for all (or most popular) integrations that have routing rules specified for k8s container logs. I imagine that the registry APIs would need to expose a list of other integrations that have routing rules for a particular integration.
For @hop-dev's user journey question that would imply that the routing configuration is tied to the lifecycle of the source integration (such as k8s).
Good point, this would be one of the caveats. We could probably come up with ways to refresh this periodically, but it's a little less than ideal.
I do suspect that users want more control over this - and we need a way to show them more integrations that may have routing rules that are relevant for their use case. This is where I think the design around auto-suggestion of integrations to install may be helpful, where routing only happens once the user has decided to install an integration to enhance the data that is extracted.
Relevant spacetime: https://github.com/elastic/observability-dev/issues/2132
Synced with @kpollich about this today.
We discussed the general concept and UX flow from Discover:
reroute
processor to route logs to the new datasetlogs-*
, choose kubernetes.container_logs
data setcontainer.image.name: foo
message
that is only in logs from the foo
imagemessage
logs-foo-default
logs-foo-*
index template and ingest pipelinereroute
processor to the logs-kubernetes.container_logs
ingest pipeline to point to route logs to logs-foo
@grabowskit @ruflin can you confirm our understanding of the UX and the APIs we're discussing?
- We expect users to be able to use filters in Discover to define a set of data and then choose to create a new dataset/datastream for the documents that match the filter.
Using KQL to create routing rules sounds like a cool feature but also adds complexity. Could we start with a workflow where users manually create a painless condition?
- Discover will need a Fleet API to call when the user chooses to create this new dataset. At a high level, this API should allow Discover to:
- Create a new index template & ingest pipeline for the new dataset (if the target datatset does not exist)
Do we need to already create a new index template at that point? We could just rely on the default index template for logs-*-*
.
One challenge is that a routing rules create a variable number of datasets. For example when routing on {{service.name}}
, we don't know all datasets beforehand.
By relying on the logs-*-*
template, we don't need to know the concrete datasets beforehand.
- Persist the routing rule in a Kibana Saved Object
- Update the source dataset's ingest pipeline with the
reroute
processor to route logs to the new dataset
Why are we persisting routing rules as saved objects and not just as a processor in the routing pipeline? Not having a single source of truth could lead to the reroute processors and saved objects to not be in sync.
- On package upgrades of the base package, the "forked" datastream(s) for the new dataset should have its base index template upgraded. What is the base index template? Based on the source dataset + @Custom
I was thinking that we could also just rely on the logs-*-*
index template for that.
- Which saved object are routing rules saved to?
- We agreed that rules should be stored on the target dataset/package/package policy, not the source or "sink"
I'm not sure if that's possible. The reroute processor needs to be added to the pipeline of the sink. Also, a single routing rule can result in creating multiple datasets that are unknown at the time the routing rule is created.
- Should the ingest pipeline be reusued from the base datastream? Yes, logs should always be parsed the same regardless of if they're routed through the "sink" data stream or directly to the specialized data stream.
Hm, I see where you're coming from. But if both the sink and the destination have the same pipeline, we would do the same transformation twice and not all transformations are idempotent.
@felixbarny One thing that is clear from the discussion so far is that the first priority use case is not defined. I'm not sure if we first want to target the simpler use case of "route these specific logs to a new data stream" or the more complex "route all of the logs to multiple data streams based on field X". Let's call these:
Eventually we'll need to support both, but which are we starting with?
Do we need to already create a new index template at that point? We could just rely on the default index template for
logs-*-*
.One challenge is that a routing rules create a variable number of datasets. For example when routing on
{{service.name}}
, we don't know all datasets beforehand.By relying on the
logs-*-*
template, we don't need to know the concrete datasets beforehand.
You're right, I was focusing on use case (1). If there's not a specific target data stream for a routing rule, I agree that the default template can be relied on.
Why are we persisting routing rules as saved objects and not just as a processor in the routing pipeline? Not having a single source of truth could lead to the reroute processors and saved objects to not be in sync.
I'm not sure if that's possible. The reroute processor needs to be added to the pipeline of the sink. Also, a single routing rule can result in creating multiple datasets that are unknown at the time the routing rule is created.
I'm thinking ahead to this "export as a package" aspect. For that to work, I think a specific package needs to own the routing rule, and for use case (1), the destination package/data stream should be the one to own the rule, not the sink. We could use the routing pipeline as the source of truth, but for use case (1) there still needs to be some link between the processor and the destination data stream. Maybe the contents of the reroute processor itself is good enough.
You are right that the reroute processor itself would always be in the "sink" data stream's ingest pipeline, even if it's owned by another package/data stream.
I'm not sure how to think of "export a package" for more dynamic routing rules that fan out to multiple data streams (use case (2)). Do you think these would be a better fit as @custom
extensions to the sink data stream or should there also be a backing package for these?
Hm, I see where you're coming from. But if both the sink and the destination have the same pipeline, we would do the same transformation twice and not all transformations are idempotent.
Yeah I thought about this more last night and I agree, this will be too hard to support.
- Single target data stream
- Multiple target data streams, based on a dynamic field
Eventually we'll need to support both, but which are we starting with?
The reroute processor will support both use cases. I think we'll also want to update some integrations to make use of 2. relatively soon. For example, changing the syslog integration to route by app_name or the k8s integration to route by service.name
which is inferred via app.kubernetes.io/name
or the container name.
I will also be possible for users to just start ingesting data into a custom dataset that's not known beforehand. I don't think we'd want to require them to manually create an integration before they can start ingesting data.
I suppose we'll need to be able to "lazily" create an integration. For example, at the time the user wants to add a pipeline to their custom dataset.
I'm thinking ahead to this "export as a package" aspect. a specific package needs to own the routing rule, and for use case (1), the destination package/data stream should be the one to own the rule, not the sink.
Ah, I see. That makes sense. It would be similar to the built-in Nginx integration adding a routing rule to the kubernetes.container_logs
dataset. The Nginx integration owns that rule but it's added to the k8s routing pipeline.
@joshdover happy to see that your results are pretty similar to what I naively had in mind :tada:
Discover will need a Fleet API to call when the user chooses to create this new dataset. At a high level, this API should allow Discover to:
- Create a new index template & ingest pipeline for the new dataset (if the target datatset does not exist)
- Persist the routing rule in a Kibana Saved Object
- Update the source dataset's ingest pipeline with the
reroute
processor to route logs to the new dataset
I wonder if we ever want these assets to fly around without an owning integration. Would this be the place where a new integration is created that owns them?
Hi all. I'm starting to work on some technical definition for this work around Fleet and the Package Spec. I'd like to walk through a basic example to make sure I understand what we need to support here.
Let's say we support routing rules at the package spec level, e.g.
# In the nginx integration, which is our "destination", let's say a `routing_rules.yml` file
rules:
- description: Route Nginx Kubernetes container logs to the Nginx access data stream
source_data_stream: logs-kubernetes.router
destination_data_stream: logs-nginx.access
if: >
ctx?.container?.image?.name == 'nginx'
❓ I'm basing this logs-kubernetes.router
data stream off of what I see in Felix's doc regarding designated "sink" data streams for routing purposes. Is this still accurate here?
When this integration is installed, Fleet will parse out this routing_rules.yml
file and add a corresponding reroute
processor to the Kubernete's integrations logs-kubernetes.router-{version}
ingest pipeline. e.g
{
"processors":
[
{
"reroute": {
"tag": "nginx",
"if": "ctx?.container?.image?.name == 'nginx'",
"dataset": "nginx.access"
}
}
]
}
❓ If the Kubernetes integration isn't installed, would Fleet need to install it at this time? I understand the inverse case, where we'll need to "preinstall" all related component templates + ingest pipelines for "destination" packages as @joshdover mentioned in https://github.com/elastic/kibana/issues/151898#issuecomment-1451760052.
I understand there are other pieces here like
...but I'd just like to make sure I'm on the right path with the above example. Thanks!
Spent some more time with this. I think I've boiled our needs here down into three main feature sets:
I've got some notes on the first 2 points here, but I'm still thinking through number 3.
Here's a napkin sketch overview of what I've been brainstorming:
Goal: allow integrations to ship with routing rules that will generate corresponding reroute
processors on the appropriate ingest pipelines
routing_rules.yml
file that includes a list of routing rulestag
property with the package's name and versionPackage spec: routing_rules.yml
file in "destination integrations" e.g. for nginx
:
# In the nginx integration
rules:
- description: Route Nginx Kubernetes container logs to the Nginx access data stream
source_dataset: kubernetes.router
destination_dataset: nginx.access
if: >
ctx?.container?.image?.name == 'nginx'
Supporting routing rules in the package spec would be a separate chunk of work from Fleet support, which I'll get into next. We'll need to make sure the spec + EPR endpoints all fully support routing rules as a first class resource to support our other features here.
Goal: When an integration is installed, Fleet must also install index/component templates + ingest pipelines for any datasets to which the integration might potentially route data.
Fleet needs a means of fetching all integrations to which the "data sink" integration currently being installed might route data. EPR should provide a /routing_rules
API that returns all routing rules defined across all packages with a denormalized format like
[
{
"integration": "nginx",
"source_dataset": "kubernetes.router",
"destination_dataset": "nginx.access",
"if": "ctx?.container?.image?.name == 'nginx'"
},
{
"integration": "apache",
"source_dataset": "kubernetes.router",
"destination_dataset": "apache.access",
"if": "ctx?.container?.image?.name == 'apache'"
}
]
This would allow Fleet to perform a lookup for any datasets where source_dataset
matches the current integration, and create the corresponding templates + pipeline for that dataset, as defined by the destination integration.
❓ - Is limiting the assets installed for destination datasets necessary, or should we just install everything including Kibana assets? Could we just perform a "standard" installation as part of that process, or is the performance hit of doing that too substantial?
I've got another napkin sketch for this part as well:
I think with these two chunks of work, we'd be able to call package-based routing support done as far as Fleet is concerned. We'd be able to
reroute
processor and destination index template + ingest pipeline exist such that the routed data is mapped and processed correctlyI'll spend some more time with the customization API requirements here, but feel free to chime in with any feedback if you have it before I come back with more.
Spent some time thinking through the CRUD API needs here, which I'll summarize below.
Fleet will provide an API endpoint for persisting "custom logging integrations" which include:
An example API request might look like this:
POST /api/fleet/custom_logging_integrations
{
"name": "my_custom_integration",
"title": "My custom integration",
"description": "Created by {user.name} on {date}",,
"dataset": "my.application",
"mappings": [] // ❓
"routing_rules": [
{
"description": "Route Kubernetes container logs for our custom app containers to the custom integration dataset"
"source_dataset": "kubernetes.router",
"if": "ctx?.container?.image?.name == 'acme.co/my-application'"
}
]
}
❓ One thing I'm not sure on is the mappings provided. Does the user go through and define fields or select mapping types for the fields detected in their custom logs?
Am I right in thinking these customization APIs are a fairly separate effort from the package-level routing rules work above? I think we could pretty realistically get started on supporting document based routing rules defined by packages without much more definition that what I've done here, but these customization APIs seem like they're still a bit in flux. Persisting these custom integrations (or "exporting" as it's been referred to a few times above) seems like a follow-up to the package level support.
Maybe I'm missing something but I get stuck on the following statements repeated several times above:
So if a user installs the k8s integration, Fleet would install the ES assets for all (or most popular) integrations that have routing rules specified for k8s container logs.
[...] we'll need to "preinstall" all related component templates + ingest pipelines for "destination" packages [...]
When an integration is installed, Fleet must also install index/component templates + ingest pipelines for any datasets to which the integration might potentially route data.
Why do we need to install the assets of all potential destination packages when installing the source package? Isn't the explicit installation of the destination package what inserts the routing rule? Then why do we need the destination assets earlier already when there's no routing to it in place?
Isn't the explicit installation of the destination package what inserts the routing rule? Then why do we need the destination assets earlier already when there's no routing to it in place?
I wasn't 100% clear on this. My assumption was that if we install the kubernetes
data-sink integration, we'd want all Kubernetes-related routing rules to also come along with it. It sounds like that may be incorrect. I think I was misunderstanding the "auto installation of integrations" called out in the design doc for this feature.
So, it sounds like the "Optimistic installation of destination dataset assets" isn't necessary here. Only when a user explicitly installs a "destination" integration will any "source" integrations' ingest pipelines be updated.
My main source of confusion, I think, came from this prior comment
Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?
(apologies for the spam)
@hop-dev's comment above is the source of the above conversation:
or if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?
I think this point is still valid. Should we expect users to install the Nginx integration manually because it's a routing destination for Kubernetes container logs? How will we surface that knowledge and make it clear to users in an instance like this? It makes sense technically for us to defer the creation of reroute
processors until a manual installation for some destination integration, but users won't have any discoverability into those routing rules in our current design, right?
Here's how I understand this would work without the optimistic installation process:
❓ I'm basing this logs-kubernetes.router data stream off of what I see in Felix's doc regarding designated "sink" data streams for routing purposes. Is this still accurate here?
Yes, that's still the plan. But we haven't implemented that, yet.
❓ If the Kubernetes integration isn't installed, would Fleet need to install it at this time?
No, I don't think so. But if the k8s integration is installed at a later time, it should include the rule from the Ngingx package if it has already been installed.
So, it sounds like the "Optimistic installation of destination dataset assets" isn't necessary here. Only when a user explicitly installs a "destination" integration will any "source" integrations' ingest pipelines be updated.
Yes, I think we should start with that.
if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?
Good question. I think there's no need for the user to install the integration we can just store the raw logs and the users might be fine with that. Maybe we could suggest the user to install the Nginx integration when we detect that the user looks at Nginx logs.
Other questions I have is whether we should install a destination integration when data is sent to a data stream that this integration manages? Also, would there be a way for users to set a label in their containers to give a hint about installing the integration?
Overall, I think the most important routing workflow that we should support for now is not necessarily a destination integration to register a routing rule in a source integration but to do default routing in the source integrations on the service or container name and to enable users to add custom routing rules in source integrations.
Other questions I have is whether we should install a destination integration when data is sent to a data stream that this integration manages? Also, would there be a way for users to set a label in their containers to give a hint about installing the integration?
I would decouple this completely from routing and labels. This is something we should have in general. If data is shipped to a dataset and we have an existing integration for it, we should recommend it to be installed.
Should we expect users to install the Nginx integration manually because it's a routing destination for Kubernetes container logs? How will we surface that knowledge and make it clear to users in an instance like this?
That's a good question. While not explicitly mentioned anywhere, I think this could be one of the workflows started from the log explorer. When the user has selected the "unrefined" k8s logs data stream we could offer a starting point to jump into the workflow that installs a specialized integration (e.g. "nginx"). From then on the nginx docs would not show up in the k8s logs anymore but new nginx-related data streams would show up in the data stream selector.
I wonder if we would want to start off the workflow in a generic manner or if there would be an efficient way to query which integrations have routing rules for k8s. Then we could directly offer a selection of routing-aware integrations at the start of the workflow.
we could offer a starting point to jump into the workflow that installs a specialized integration (e.g. "nginx")
I really like this idea on a broader scope that this recommendation show up in Discover for the log lines, for example as an action.
I want to recenter the discussion in this issue based on the dependencies called out in https://github.com/elastic/ingest-dev/issues/1670, which we're using to track all the work that falls on ingest for Logs+.
This issue touches on two pieces of the "Phase 1" dependencies called out in the linked issue:
reroute
processors in ingest pipelinesWe're also touching on a "Later phases" piece here which is the on-the-fly creation of "custom integrations" which will wrap these user customizations, assets, etc into an integration that's persist as an epm-package
saved object. This is, however, out of scope for the actual implementation plan in this issue. I'd suggest we move further discussions around that feature over to https://github.com/elastic/ingest-dev/issues/1670 or https://github.com/elastic/logs-dev/issues/61.
I'll be updating the description here with some links to child issues in appropriate repos for each piece of work.
Catching up here after being out for a couple weeks. This is looking great, thank you @kpollich for the diagrams.
I definitely agree that we don't need to do any implicit package installation and should only install the reroute processors if both the source and destination packages are installed. We can later enhance the UX to suggest to users to install packages as needed/desired as others suggested above.
I think it would be helpful to classify these different sources of routing rules into 3 categories, rather than 2. This is also sorted by what I believe to be the priority, highest-to-lowest:
(1) and (2) can likely share the same data model (the yaml file in the package) and should be put into the ingest pipeline in the same (managed) location. (3) can probably be represented simply as reroute
processors in the @custom
pipeline we already support to avoid multiple sources of truth.
One more aspect of this that we need to define before we begin implementation of https://github.com/elastic/kibana/issues/155910 is the ordering of the ingest pipeline. We need to know what processing should be done in which order between package-defined data processing, package-defined routing, user-defined routing, and user-defined processing. This is something that I hope @felixbarny and @ruflin can provide input on based on any experimentation being done with the pieces we landed in 8.8.
Thanks @joshdover - I think distinguishing routing rules for custom packages vs existing packages is helpful. I'll try to make this clear in the dataset customization tasks here.
@juliaElastic - This is another issue it'd be good to catch up on in preparation for your time as interim tech lead in the coming months. Let's chat about this at some point, but please catch up here when you have some spare time 🙏
@yaauie , does the logstash elasticsearch pipeline runner support the reroute processor (or plan to)? I could see this affecting the work I saw in your presentation at the all hands.
We keep talking about package level routing rules. My assumption is, all routing rules are on the dataset level. Either the dataset is the source or the target. All the configs would happen in the dataset manifest?
I think we're just using "package level" to mean "routing rules defined somewhere in a package manifest" - not necessarily that they operate globally on a package.
We can certainly include routing rules as part of the data_streams/foo/manifest.yml
file. My initial proposal was a top-level routing_rules.yml
file for each package, but I think putting this as the dataset level would also make sense. Something like this for example:
# nginx/data_stream/access/manifest.yml
title: Nginx access logs
type: logs
streams:
- input: logfile
vars:
- ...
title: Nginx access logs
description: Collect Nginx access logs
routing_rules:
- source_dataset: kubernetes.router
if: >
ctx?.container?.image?.name == 'nginx'
- input: httpson
vars:
- ...
The reason I propose to set this at the dataset level, is that there is always a source and destination dataset involved, I doesn't really happen on the package level. And if no destination dataset can be found, the destination is the source dataset itself.
graph LR;
A[Source Dataset];
Doc((Doc));
B[Destination Dataset];
ES[(Elasticsearch)]
Doc--> A;
A-->B;
A-->A;
B-->ES;
In the proposal you have above, you nest the routing rules under the streams / input configs. I don't think the routing rules are related to the inputs / streams in any ways. A package can define routing rules without ever having specified an input. Somehow related: https://github.com/elastic/kibana/issues/155999
Lets go through some example use cases with the configs. I would split up the use cases slightly different from the proposal from Josh:
This can be done today as a package developer can extend their ingest pipeline. But it would be nice to have a separate definition for it so Fleet is better aware of the routing rules. As use case, lets take k8s routing logs base on the container name. The pipeline defintion could look as following:
reroute:
tag: logs-k8s.router
dataset": "{{container.image.name}}"
namespace":
- {{labels.data_stream.namespace}}
- default
This pipeline would have to be installed for logs-k8s.router-*
. The destination dataset is dynamic, we can't specify it.
These routing rules can be extended in 2 ways
nginx
installs a special ruleIn both use cases, it can be assumed the target dataset exists as part of an integration. An exception to this if a user would add a pattern with a variable as destination.
The use case here, is that a package like nginx
wants to accept data shipped to logs-nginx-*
and route it based on file path or stdout/stderr to the correct dataset. In this scenarion, all dataset are internal to the package. It simplifies data shipping and makes sure also new nginx data is picked up and nginx
can be used as fallback.
This scenario is also possible today by just using the ingest pipeline definition, but it would be nice for package devs to have a simpler yaml definition for it. Users are not expected to add their routing rules to it.
Taking the above use cases, there are 2 different cases to cover:
I suggest to take source
as the default as I think this is the more common scenario and the simpler one. This could lead to a config like the following:
# k8s/data_stream/router/manifest.yml
title: K8s router for logs
type: logs
routing_rules:
- dataset": "{{container.image.name}}"
namespace":
- {{labels.data_stream.namespace}}
- default
# I expect the tag to be generated dynamically by Fleet in a smart way
#tag: {{logs-k8s.router-?}}
The "source" dataset is automatically the k8s.router
dataset. For the nginx use case where it installs it in a different dataset, we can reuse what Kyle proposed above:
# nginx/data_stream/acces/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
- source_dataset": "k8s.router"
if: "ctx?.container?.image?.name == 'nginx'"
namespace":
- {{labels.data_stream.namespace}}
- default
# I expect the tag to be generated dynamically by Fleet in a smart way
#tag: {{logs-k8s.router-?}}
# This value will be automatically filled in by Fleet
#dataset: {{dataset.name}}
If source_dataset
is used, dataset
cannot be configured differently then the dataset name. If dataset
is configured, source_dataset
cannot be used. It is possible that in the routing rules arry, both exists. Here an example:
# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
routing_rules:
# Routing rule for k8s
- source_dataset": "k8s.router"
if: "ctx?.container?.image?.name == 'nginx'"
namespace":
- {{labels.data_stream.namespace}}
- default
# I expect the tag to be generated dynamically by Fleet in a smart way
#tag: {{logs-nginx-?}}
# This value will be automatically filled in by Fleet
#dataset: {{dataset.name}}
# Routing nginx error logs
- if: "ctx?.file?.path?.contains('/var/log/nginx/error)'"
# This value will be automatically filled in by Fleet
dataset: nginx.error
# I expect the tag to be generated dynamically by Fleet in a smart way
#tag: {{logs-nginx-?}}
# Routing nginx error logs
- if: "ctx?.file?.path?.contains('/var/log/nginx/access)'"
# This value will be automatically filled in by Fleet
dataset: nginx.access
# I expect the tag to be generated dynamically by Fleet in a smart way
#tag: {{logs-nginx-?}}
We need to figure out how the tags are generated. I was also tempted to potentially split up the ones with source and destinations into 2 separate arrays. Something like:
routing_rules:
# Routing rules for the current dataset. I don't like that it requires a name.
nginx:
-
# Routing rules for other datasets
k8s.router:
-
Thanks, @ruflin . I think I understand everything we're working through here, but I'm going to try and summarize the comment above with my own questions/comments appended as we go along. Forgive my long-windedness 😅
We have two categories of package-defined routing rules
Packages can define routing rules for dynamic datasets + namespaces, which are then installed under an index template that matches on the index pattern logs-{{integration}}.{{dataset}}-*
. See example below.
# kubernetes/data_stream/router/manifest.yml
title: K8s router for logs
type: logs
routing_rules:
- dataset: "{{container.image.name}}" # The `reroute` processor support dynamic values from document properties
namespace:
- {{labels.data_stream.namespace}}
- default # This is a fallback if the document doesn't include the above property
This package-defined routing rule should result in the following ingest pipeline being created by Fleet:
// Ingest pipeline - logs-kubernetes.router-1.2.3
{
"processors": [
{
"reroute": {
"tag": "logs-k8s.router",
"dataset": "{{container.image.name}}",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
]
}
}
]
}
Package installation will also generate an Index Template with a pattern of logs.k8s.router-*
which - if I understand correctly - is how Fleet currently works. e.g.
// Index template - logs-kubernetes.router-* (truncated for brevity)
{
"template": {
"settings": {
"final_pipeline": ".fleet-final-pipeline-1",
"default_pipeline": " logs-kubernetes.router-1.2.3"
}
}
}
For this case, I have a few questions:
kubernetes.router
won't have any if
conditions, correct? They work by variable substitution based on document fields - so this pipeline will always fire on every document and attempt to route it.{{container.image.name}}
in the above example? Does the processor fail or does it just not fire, resulting in unrouted documents in the kubernetes.router
dataset?tag: {{logs-k8s.router-?}}
as a "proposed" value that Fleet would intelligently fill in. What would be the expectation for that value? As far as I understand, the tag needs to be written at ingest pipeline creation time, so we can't dynamically add a more specific tag that the datastream type + dataset values. Might be missing something there, as we're discussing tag generation as an unknown in general.The "dynamic routing rules" use case also includes routing rules defined on "destination" datasets, like Nginx, e.g.
# nginx/data_stream/access/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
- source_dataset: "k8s.router"
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
The above example would result in a pipeline as follows:
// Ingest pipeline - logs-kubernetes.router. "Destination" routing rules alter the "source" pipeline
{
"processors": [
{
"reroute": {
"tag": "logs-k8s.router-nginx", // In this case, we could actually generate a dynamic tag that links back to the destination package
"dataset": "nginx.access",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
],
"if": "ctx?.container?.image?.name == 'nginx'"
}
}
]
}
I think the "destination-defined" case here is a little more straightforward than the "source-defined" case which utilizes more variables.
These would be routing rules that are "local" to a given integration. The example of routing arbitrary documents from logs-nginx-*
to a more specific data stream like logs-nginx.access
based on some condition is a good one.
So, if we defined a routing rule such as
# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
# I assume a "routing" datastream would utilize these features - https://github.com/elastic/kibana/pull/154732
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
- if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
dataset: nginx.error
namespace:
- {{labels.data_stream.namespace}}
- default
- if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
dataset: nginx.access
namespace:
- {{labels.data_stream.namespace}}
- default
It'd result in an ingest pipeline, e.g.
// Ingest pipeline - logs-nginx.nginx
{
"processors": [
{
"reroute": {
"tag": "logs-nginx.nginx",
"dataset": "nginx.error",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
],
"if": "ctx?.file?.path?.contains('/var/log/nginx/error')"
}
},
{
"reroute": {
"tag": "logs-nginx.nginx",
"dataset": "nginx.access",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
],
"if": "ctx?.file?.path?.contains('/var/log/nginx/access')"
}
}
]
}
My questions/comments here are as follows:
logs-nginx.nginx
type/dataset expected here? If we're working with a "router" data stream for this integration and putting it in data_stream/nginx/manifest.yml
then I believe this is going to be the resolved dataset. The Kubernetes example uses kubernetes.router
for the dataset. Maybe that's a pattern we can adapt for all data stream manifests that define routing rules?destination_dataset
to be explicit in routing rule definition rather than dataset
which aligns with what eventually lands in the generated reroute
processor(s). Since source_dataset
and dataset
can appear in the same array of routing rules, I think erring on the side of explicitness would be helpful in maintainers developing an accurate mental model.For completeness, I'll also reiterate Ruflin's final examples above and map it to the generated ingest pipelines. I've removed any proposals/speculations and stuck with concrete values based on the above context for clarity here.
# kubernetes/data_stream/router/manifest.yml
title: K8s router for logs
type: logs
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
- dataset: "{{container.image.name}}"
namespace:
- {{labels.data_stream.namespace}}
- default
# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
# Route K8s container logs to the Nginx catch-all dataset
- source_dataset: "k8s.router"
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route error logs to the nginx.error dataset
- destination_dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route access logs to the nginx.access dataset
- destination_dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namespace:
- {{labels.data_stream.namespace}}
- default
# nginx/data_stream/nginx/access/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
# Route K8s container logs to the Nginx access logs data stream
- source_dataset: "k8s.router"
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
// Ingest pipeline - logs-kubernetes.router-1.2.3
{
"processors": [
{
"reroute": {
"tag": "logs-k8s.router",
"dataset": "{{container.image.name}}",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
]
}
},
{
"reroute": {
"tag": "logs-nginx.access",
"dataset": "nginx.access",
"if": "ctx?.container?.image?.name == 'nginx'",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
]
}
}
]
}
// Ingest pipeline - logs-nginx.nginx
{
"processors": [
{
"reroute": {
"tag": "logs-nginx.nginx",
"dataset": "nginx.error",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
],
"if": "ctx?.file?.path?.contains('/var/log/nginx/error')"
}
},
{
"reroute": {
"tag": "logs-nginx.nginx",
"dataset": "nginx.access",
"namespace": [
"{{labels.data_stream.namespace}}",
"default"
],
"if": "ctx?.file?.path?.contains('/var/log/nginx/access')"
}
}
]
}
Does that seem accurate based on everything we've discussed above?
One thing I'm wondering is if the specific Kubernetes container logs routing rule that routes logs to nginx.access
is even necessary in this case. It seems like relying on labels/document fields in a "source" data stream will be a more common use case than defining rules in a "destination" data stream as mentioned above.
Another thing that's come to mind as I'm working on other tasks: we're going to need to be careful with {{}}
syntax in pipeline definitions. We're looking to add handlebars support to ingest pipelines in https://github.com/elastic/package-spec/issues/517 (issue still WIP) as part of the data fidelity project. Fleet will need to tolerate template variables that don't resolve in pipeline definitions for cases where data fidelity functionality is present alongside the dynamic syntax in reroute
processors.
I think since reroute processors will be generated on-the-fly by Fleet, this should be okay, but it's still worth pointing out that we'll have a syntax collision here as proposed.
- Dynamic routing rules on "source" dataset like
kubernetes.router
won't have anyif
conditions, correct? They work by variable substitution based on document fields - so this pipeline will always fire on every document and attempt to route it.
I think that will mostly be the case but there may also be situations where you have both an if
condition and a field reference in the target dataset
or namespace
. For example "route to {{data_stream.dataset}}
, unless it has the same value as the current dataset".
- What happens when a document doesn't contain
{{container.image.name}}
in the above example? Does the processor fail or does it just not fire, resulting in unrouted documents in thekubernetes.router
dataset?
It would use the current dataset as a default (kubernetes.router
)
See also https://www.elastic.co/guide/en/elasticsearch/reference/master/reroute-processor.html
- In Ruflin's example above, we have
tag: {{logs-k8s.router-?}}
as a "proposed" value that Fleet would intelligently fill in. What would be the expectation for that value? As far as I understand, the tag needs to be written at ingest pipeline creation time, so we can't dynamically add a more specific tag that the datastream type + dataset values. Might be missing something there, as we're discussing tag generation as an unknown in general.
Good question. I'm not sure if we can auto-assign a tag. I was thinking that we might have an id
field that the package dev needs to set that's required to be unique. We could use that as the value for tag
.
// Ingest pipeline - logs-kubernetes.router-1.2.3
The ordering of the processors is not right here. The first reroute processor doesn't have a condition - therefore it's always going to be executed and will short-circuit the rest of the pipeline.
I keep stumbling over the source/destination
definition, especially that the syntax is not exactly the same as in the pipeline which I think will cause problems down the line. Here an alternative idea:
# Routing rules defined for THIS dataset
routing_rules:
# Route error logs to the nginx.error dataset
- dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route access logs to the nginx.access dataset
- dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Routing rules defined for a different source dataset
source_routing_rules:
# Routing rules for k8s
k8s.router:
# Route K8s container logs to the Nginx catch-all dataset
- # Dataset must be validated to be the same as the current dataset
dataset: "nginx"
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Made up example, ignore that it looks like k8s, it is to make the point
# multiple dataset can be specified
syslog:
- # Dataset must be validated to be the same as the current dataset
dataset: "nginx"
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
I consider the source
use case the more complex one and the one that is less often used.
Packages can define routing rules for dynamic datasets + namespaces, which are then installed under an index template that matches on the index pattern
logs-{{integration}}.{{dataset}}-*
.
Integrations should also be able to route to logs-{{dataset}}-*
and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs-*-*
index pattern. See also https://github.com/elastic/elasticsearch/issues/95538
But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*
. That's because the reroute
processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}"
, for example. Do you think that would be required?
- Is the
logs-nginx.nginx
type/dataset expected here? If we're working with a "router" data stream for this integration and putting it indata_stream/nginx/manifest.yml
then I believe this is going to be the resolved dataset.
The nginx.nginx
dataset looks a little strange tbh. I'd expect either just nginx
or nginx.router
.
The Kubernetes example uses
kubernetes.router
for the dataset. Maybe that's a pattern we can adapt for all data stream manifests that define routing rules?
I'm going back-and-forth on that. I think it makes sense as a convention for datasets that aren't expected to contain any data and just do routing. However, some datasets, such as syslog
may not contain any default routing rules but users may choose to add some.
- I'd propose we use
destination_dataset
to be explicit in routing rule definition rather thandataset
which aligns with what eventually lands in the generatedreroute
processor(s). Sincesource_dataset
anddataset
can appear in the same array of routing rules, I think erring on the side of explicitness would be helpful in maintainers developing an accurate mental model.
I think we should not have them both appear in the same array. I'd even split these to different files. One file that has all the routing rules that go do the routing pipeline of the current dataset and another file that lets you add routing rules to other datasets. To me, that's the main distinction between the two different use cases rather than dynamic vs static rules: Routing rules for the current dataset and rules that are injected into other datasets.
For rules that are injected into other datasets, we'll need to add a priority concept so that they're sorted accordingly. The ordering also necessitates having an identity for routing rules. The injected rules should also always go before any routing rules that the source dataset has defined itself. We probably also want to have a dedicated pipeline for the injected routing rules.
To keep things simple for now, I think we should focus on the routing rules that are just added to the same dataset and not spend too much time on implementing the rule injection.
I think that will mostly be the case but there may also be situations where you have both an if condition and a field reference in the target dataset or namespace. For example "route to {{data_stream.dataset}}, unless it has the same value as the current dataset".
👍
It would use the current dataset as a default (kubernetes.router) See also https://www.elastic.co/guide/en/elasticsearch/reference/master/reroute-processor.html
Perfect. Users will just need to be aware that documents that fail to route will remain in the "sink" data set. No real concerns here on my end.
Good question. I'm not sure if we can auto-assign a tag. I was thinking that we might have an id field that the package dev needs to set that's required to be unique. We could use that as the value for tag.
The dataset value should be guaranteed unique by package validation, e.g. nginx.access
can't appear in multiple packages. If we append a unique routing rule name/ID to that I think it'd be the most valuable option. It's something that might be annoying to integration maintainers though - coming up with a unique name for all their rules is a bit of a burden.
The ordering of the processors is not right here. The first reroute processor doesn't have a condition - therefore it's always going to be executed and will short-circuit the rest of the pipeline.
Good catch. I wasn't sure on how we'd want to guarantee order here. Should the order be based on the order of processors as they appear in the YAML, with conditionless processors pushed to the end of the list? Part of me just wants to honor the order as they appear in the integration, but again it's more burden on the maintainers to understand the implementation details of reroute processors.
I keep stumbling over the source/destination definition, especially that the syntax is not exactly the same as in the pipeline which I think will cause problems down the line. Here an alternative idea:
I'm not 100% sure about aligning package spec fields exactly with Elasticsearch APIs, fields, etc. It's not something we've been consistent about, but maybe that should change here.
I do like the example of splitting these rules into different arrays rather than trying to reason about a mixture of use cases in a single list. Then, like you mentioned, we don't have to introduce new names for the existing concept of dataset
on routing rules - it always means the same thing it does in the reroute processor docs.
If dataset
values under source_routing_rules
are always guaranteed to the be the current dataset via validation, does it make sense to even include that field, or can this be something we document in the spec and prevent user input entirely for that field?
Integrations should also be able to route to logs-{{dataset}}- and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs--* index pattern. See also https://github.com/elastic/elasticsearch/issues/95538
Integration assets generated by EPM are prefixed in most cases with the integration name. Would this mean Fleet needs to create an index template with a different pattern
for some cases like this?
But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*. That's because the reroute processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}", for example. Do you think that would be required?
I don't think I completely follow this. Could you provide an example of what this routing rule setup would look like or a use case?
The nginx.nginx dataset looks a little strange tbh. I'd expect either just nginx or nginx.router.
The plain nginx
value would require a special case implemented in the package spec. nginx.router
will work as expected with no additional implementation, so I'm in favor of that.
I'm going back-and-forth on that. I think it makes sense as a convention for datasets that aren't expected to contain any data and just do routing. However, some datasets, such as syslog may not contain any default routing rules but users may choose to add some.
Hmm it actually might make sense that we need to support a dataset that's only the integration name if we have routing rules with dataset: {{container.image.name}}
now that I think about it. We'll have to have a dataset of just nginx
or apache
in order to route logs based on that value.
I think we should not have them both appear in the same array. I'd even split these to different files. One file that has all the routing rules that go do the routing pipeline of the current dataset and another file that lets you add routing rules to other datasets. To me, that's the main distinction between the two different use cases rather than dynamic vs static rules: Routing rules for the current dataset and rules that are injected into other datasets.
Yeah I'm +1 on splitting these rules into two distinct lists.
For rules that are injected into other datasets, we'll need to add a priority concept so that they're sorted accordingly. The ordering also necessitates having an identity for routing rules. The injected rules should also always go before any routing rules that the source dataset has defined itself. We probably also want to have a dedicated pipeline for the injected routing rules.
To keep things simple for now, I think we should focus on the routing rules that are just added to the same dataset and not spend too much time on implementing the rule injection.
Fair enough. Using Ruflin's example above we'd focus first on supporting routing_rules
and then move on to supporting source_routing_rules
as a second pass. I'm fine with that approach and it helps us narrow the scope for the initial implementation here.
The plain nginx value would require a special case implemented in the package spec. nginx.router will work as expected with no additional implementation, so I'm in favor of that.
This is possible today, you have to set dataset: nginx
in the manifest.yml
.
I took a pass at updating https://github.com/elastic/package-spec/issues/514 based on the conversation above and a quick offline chat I had with @ruflin. I think the key part of this is the example manifest.yml
file, which I'll copy/paste here for reference
# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
# This is a catch-all "sink" data stream that routes documents to
# other datasets based on conditions or variables
dataset: nginx
# Ensures agents have permissions to write data to `logs-nginx.*-*`
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
# Route error logs to `nginx.error` when they're sourced from an error logfile
- dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route access logs to `nginx.access` when they're sourced from an access logfile
- dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namespace:
- {{labels.data_stream.namespace}}
- default
injected_routing_rules:
# Route K8's container logs to this catch-all dataset for further routing
k8s.router:
- dataset: nginx # Note: this _always_ has to be the current dataset - maybe we can infer this?
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route syslog entries tagged with nginx to this catch-all dataset
syslog:
- dataset: nginx
if: "ctx?.tags?.contains('nginx')"
namespace:
- {{labels.data_stream.namespace}}
- default
Integrations should also be able to route to logs-{{dataset}}- and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs--* index pattern. See also https://github.com/elastic/elasticsearch/issues/95538
Integration assets generated by EPM are prefixed in most cases with the integration name. Would this mean Fleet needs to create an index template with a different pattern for some cases like this?
I hope that's not what it means. I was thinking that we'd just rely on the logs-*-*
index template that's embedded in ES rather than setting up a more specific index template with Fleet. But that means there'll be a difference between data streams that are set up via fleet an the ones that just use the default index template in Elasticsearch.
Maybe that's ok. If it's not, we'll need to think about how we could prefix the dataset with the integration name or how to add features to ES that would allow us to rely on the built-in logs-*-*
index template. I guess the main thing that we need to do is to mirror the component template and ingest pipeline extension points.
But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*. That's because the reroute processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}", for example. Do you think that would be required?
I don't think I completely follow this. Could you provide an example of what this routing rule setup would look like or a use case?
Let's take the following reroute processor as an example:
- reroute:
dataset: "{{service.name}}"
The resulting data stream would look like logs-{{service.name}}-default
. We can't set up index templates for that in Fleet as we have no control over the service.name
field that's sent via the documents.
The reroute processor doesn't support something like this:
- reroute:
dataset: "foo.{{service.name}}"
The example manifest.yml
looks good to me. Out of a personal preference, I'd create dedicated files for the routing_rules
and injected_ routing_rules
sections as I find that more consistent with what we're doing for ingest pipeline definitions. But whatever feels more intuitive to developers that will actually use these features in anger is fine with me.
Good catch. I wasn't sure on how we'd want to guarantee order here. Should the order be based on the order of processors as they appear in the YAML, with conditionless processors pushed to the end of the list? Part of me just wants to honor the order as they appear in the integration, but again it's more burden on the maintainers to understand the implementation details of reroute processors
I think for the routing_rules
section, the answer is relatively simple: they should just be used in the same order as they're specified. Any injected routing rule should be executed before the integration's own routing_rules
. That's because the routing_rules
often include a catch-all rule that always gets executed. If multiple integrations want to inject routing rules into the same routing dataset (for example, both the niginx and the apache integration want to inject rules to k8s), we might need to expose a way for package developers to define a precedence. However, most of these injected rules should be mutually exclusive, so the ordering shouldn't matter. But it may have performance implications which rules are executed first vs last. I'm inclined to not add this to the initial scope for injected_routing_rules
and see if a random or alphabetical order is good enough.
I like how the two routing rule concepts have been narrowed down. But I wonder if there even is a need for the routing_rules
in the package manifest. Would it make sense to instead only have "injected" rules? In this case it would mean that the nginx.access
manifest can specify to inject rules into nginx
the same way it can inject them into k8s
. Then what do we need the other rules "direction" for? What am I missing?
What I haven't been able to find in the description so far is whether the installation of the routing rules always happens or if the user gets a choice of which of the available routing rules they want to inject into the "source integration".
If we didn't make installing the rules opt-in, the user couldn't easily install the k8s integration in parallel to the nginx integration without them influencing each other. Wouldn't that be a valid use-case too?
Would it make sense to instead only have "injected" rules?
For the scenario in that particular example I think you're right. But it's needed for use cases like these:
type: logs
dataset: k8s
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
- dataset: {{kubernetes.container.name}}
Hmm, good point about making routing rule injection opt-in. I guess that's another reason why we'd want to have both ways: injected and local routing rules as we can rely on local rules to always be installed. So while the nginx.access
and nginx.error
data sets could inject routing rules to nginx
, if we make injection optional, we can't rely on the rules being installed.
Would it make sense to instead only have "injected" rules? In this case it would mean that the nginx.access manifest can specify to inject rules into nginx the same way it can inject them into k8s.
The reason I like it in the manifest is because routing rules as ingest pipeline is more an implementation detail and I would prefer that package devs do not have to think through were to put the rules in ingest pipelines. Having is separate, will also allow us to "manage" these rules and show them to our users without having to read ingest pipelines.
making routing rule injection opt-in
This is a more generic feature I would like to see in the package manager: Users have an option to remove some of the assets / not install them. Like for example dashboards that are not needed or routing rules. And if needed later, it can be added.
But it's needed for use cases like these: [...]
dataset: {{kubernetes.container.name}}
[...]
Isn't it only needed because there is no k8s.container
(or similar) dataset in the k8s package that could inject the rule into the k8s
dataset's pipeline?
in the manifest is because routing rules as ingest pipeline is more an implementation detail
I agree, and I'm not making an argument for adding it to the ingest pipeline directly. I was suggesting that we might get by with just the "injected" rules if we define them in the manifest of the "leaf" data streams instead of the package.
The downside would be that we'd need to add some datasets only for the purpose of routing, but on the upside we'd only have a single way to write rules.
We are moving forward with a solution for "document-based routing" based a new ingest pipeline processor, called the
reroute
processor. Fleet will be responsible for managing routing rules defined by packages and the end user, and updating ingest pipelines to include the new data_stream_router processor to apply the rules.Links
Overview diagram
Integration-defined routing rules
Integrations will be able to define routing rules about how data from other integrations or data streams should be routed to their own data stream. For example, the
nginx
package may define a routing rule for thelogs-kubernetes.container_logs
data stream to route logs tologs-nginx.access
whenevercontainer.image.name == "nginx"
. Similarly, when thekubernetes
package is installed and thenginx
was also previously installed, we'll need to ensure thelogs-kubernetes.router-{version}
ingest pipeline includes areroute
processor for each routing rule defined on thenginx
integration.To support this, we'll need to add a concept of routing rules to the package spec and add support for them in Fleet.
Supporting tasks
We'll need to do a few things in support of these changes as well, namely around API key permissions.
cc @ruflin @felixbarny