[Fleet] Support for document-based routing via ingest pipelines

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.65k stars 8.23k forks source link

[Fleet] Support for document-based routing via ingest pipelines #151898

Closed joshdover closed 1 year ago

joshdover commented 1 year ago

We are moving forward with a solution for "document-based routing" based a new ingest pipeline processor, called the reroute processor. Fleet will be responsible for managing routing rules defined by packages and the end user, and updating ingest pipelines to include the new data_stream_router processor to apply the rules.

Overview diagram

Integration-defined routing rules

Integrations will be able to define routing rules about how data from other integrations or data streams should be routed to their own data stream. For example, the nginx package may define a routing rule for the logs-kubernetes.container_logs data stream to route logs to logs-nginx.access whenever container.image.name == "nginx". Similarly, when the kubernetes package is installed and the nginx was also previously installed, we'll need to ensure the logs-kubernetes.router-{version} ingest pipeline includes a reroute processor for each routing rule defined on the nginx integration.

To support this, we'll need to add a concept of routing rules to the package spec and add support for them in Fleet.

# Integration-defined routing rules
- [ ] https://github.com/elastic/package-spec/issues/514
- [ ] https://github.com/elastic/kibana/issues/155910
- [ ] https://github.com/elastic/kibana/issues/157422
- [ ] https://github.com/elastic/ingest-docs/issues/276

Supporting tasks

We'll need to do a few things in support of these changes as well, namely around API key permissions.

# Supporting tasks
- [ ] https://github.com/elastic/kibana/issues/134971
- [ ] https://github.com/elastic/package-spec/issues/315

cc @ruflin @felixbarny

elasticmachine commented 1 year ago

Pinging @elastic/fleet (Team:Fleet)

ruflin commented 1 year ago

How will we manage the dependency web between packages during installation?

We are thinking it of a dependency because it is the same ingest pipeline for the first implementation. But ideally, there would be a routing API and integrations would just add their bits to it, no necessarily creating a dependency. What if we build this API in Fleet for now that manages the rules and creates the pipeline out of it? What does it mean for the package spec? I'm hoping we can stay away mostly from having dependencies.

joshdover commented 1 year ago

Yeah to be clear, I didn't mean that packages would declare dependencies on one another. But there will be things to consider to ensure that every time a package is installed, all the appropriate pipelines are updated.

joshdover commented 1 year ago

I synced on this today with Observability and discussed what the next steps should be. We came to the conclusion that we should work on a design document that includes the following

Package spec for routing rules

We need to allow packages to define routing rules for how data should be routed from other integration data streams to the integration defining the rule. For example, the Nginx integration should be able to define a rule for the Docker integration on how to detect a document that contains and nginx log and route it to the nginx.access data stream.

While the underlying implementation of the routing rule will be part of the integration's ingest pipeline, we want Fleet/EPM to have control over the order of when routing decisions happen. For these 2 reasons, routing rules should not be defined as part of an integration data stream's regular ingest pipeline. Instead they need to be defined in a separate file in the package.

We should also not abstract away too much of the underlying reroute processor's options. Integrations should be able to use the full power of that processor's capabilities, including using painless to define the rule's condition.

Ingest pipeline design

We need to specify the order in which the following things will be executed in the pipeline:

The integration-defined processing logic
The integration-defined routing rules
User-defined processing logic
User-defined routing rules
Default rerouting that should happen for all data streams (maybe)
- This is essentially re-routing a document to the right data stream if the data_stream.dataset in the document doesn't match the current data stream.
- This is to handle bugs in shippers and may not be strictly necessary.

We need to strike a balance between allow flexibility and preventing user customization from conflicting with processing and routing defined by integrations.

This design needs to be tested against real use cases (to be provided by Observability).

Part of this should also include the naming convention used for the different pipelines and sub-pipelines (eg @routing).

Fleet API for adding user-defined routing rules

Users need to be able to add their own routing rules. We may want to offer an API in Fleet/EPM to make manging this simpler. Later a UI may be built on top of this.

The design of this API should include how ingest pipelines are updated to apply the new rules or remove ones that are deleted

Internal tracking of rules and ingest pipeline updating procedure

We need to define how Fleet will internally track rules defined in packages and by end users, and how those rules will be installed into ingest pipelines with 0 disruption to ingestion.

This needs to include how package installation will work when new rules are added, how package upgrades will work, and how package uninstallation will work.

hop-dev commented 1 year ago

@joshdover I'd be interested in what we see the user experience being here.

As a user with an nginx docker container in kubernetes, I might be tempted to first look for the nginx integration for example, but the standard flow would then prompt me for the log paths etc which is unnecessary.

or if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?

It's almost like we need a dedicated kubernetes onboarding which guides the user to select the kinds of containers they will be deploying and installs the matching integrations first then the kubernetes integration, that way the user starts re-routing data straight away.

As a user I would also want to see if my data is successfully being re-routed, we could consider adding a field when we re-route the data to track the origin of the data and we could then aggregate on it to give the user a summary somewhere of data that is being re-routed.

felixbarny commented 1 year ago

Good points @hop-dev. I'd hope we can come up with a solution that doesn't require the user to manually install integrations. Instead, we'd either pre-install or auto-install the integration when data flows to a corresponding data stream. IIRC, Josh did a spacetime on this.

hop-dev commented 1 year ago

Ah yes I was involved in a spacetime project in this area: https://github.com/elastic/observability-dev/issues/2100

Slightly different to this use case, we used annotations on the containers themselves to prompt the installation of the integration. then configured the integration to consume the data directly

joshdover commented 1 year ago

Actually @hop-dev did the auto-install spacetime :) I did a related one to suggest packages to install based on signals in the common data. Both could be potential options here.

I suspect this is important as well, and ideally we don't build a solution that only works for Kubernetes sources.

I wonder if we could do some 80/20 solution where we pre-install just the ES assets for integrations that define routing rules for another integration. We can then watch for data showing up in those data streams and install the dashboards/other Kibana objects at that time.

felixbarny commented 1 year ago

pre-install just the ES assets for integrations that define routing rules for another integration. We can then watch for data showing up in those data streams and install the dashboards/other Kibana objects at that time.

This makes a lot of sense to me

weltenwort commented 1 year ago

I wonder if we could do some 80/20 solution where we pre-install just the ES assets for integrations that define routing rules for another integration.

Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?

For @hop-dev's user journey question that would imply that the routing configuration is tied to the lifecycle of the source integration (such as k8s).

joshdover commented 1 year ago

Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?

I think we're saying the same thing? So if a user installs the k8s integration, Fleet would install the ES assets for all (or most popular) integrations that have routing rules specified for k8s container logs. I imagine that the registry APIs would need to expose a list of other integrations that have routing rules for a particular integration.

For @hop-dev's user journey question that would imply that the routing configuration is tied to the lifecycle of the source integration (such as k8s).

Good point, this would be one of the caveats. We could probably come up with ways to refresh this periodically, but it's a little less than ideal.

I do suspect that users want more control over this - and we need a way to show them more integrations that may have routing rules that are relevant for their use case. This is where I think the design around auto-suggestion of integrations to install may be helpful, where routing only happens once the user has decided to install an integration to enhance the data that is extracted.

Relevant spacetime: https://github.com/elastic/observability-dev/issues/2132

joshdover commented 1 year ago

Synced with @kpollich about this today.

User flow

We discussed the general concept and UX flow from Discover:

We expect users to be able to use filters in Discover to define a set of data and then choose to create a new dataset/datastream for the documents that match the filter.
Discover will need a Fleet API to call when the user chooses to create this new dataset. At a high level, this API should allow Discover to:
- Create a new index template & ingest pipeline for the new dataset (if the target datatset does not exist)
- Persist the routing rule in a Kibana Saved Object
- Update the source dataset's ingest pipeline with the reroute processor to route logs to the new dataset
On package upgrades of the base package, the "forked" datastream(s) for the new dataset should have it's base index template upgraded.
You should be able to route to datasets that already exist

Example

You start in Discover, choose logs-*, choose kubernetes.container_logs data set
Filter for logs on container.image.name: foo
You need to parse a specific field from the log message that is only in logs from the foo image
Write a new grok processor to extract the field from message
Save the new processor in a new dataset as logs-foo-default
Discover calls Fleet API to:
- Create a new logs-foo-* index template and ingest pipeline
- Add a reroute processor to the logs-kubernetes.container_logs ingest pipeline to point to route logs to logs-foo

Questions & some potential answers

How do we describe the routing rule in the Fleet API? Painless vs. high-level option like KQL
- Supporting raw painless would allow us to put the Discover filter to painless conversion be the responsibility of the client of the Fleet API.
- Using a high-level option like KQL would bring that responsibility to the Fleet side, and potentially limit the power of the routing rule API.
What is the base index template?
- Based on the source dataset + @custom
Which saved object are routing rules saved to?
- We agreed that rules should be stored on the target dataset/package/package policy, not the source or "sink"
- We want to use a common model for package-defined and user-defined routing rules. Making user-defined rules defined on "custom" packages would be one way to achieve this.
- We may also consider short-term options like saving customizations in a particular component template as the source of truth for now. We expect "custom packages" to touch many pieces of the code base.
Should the ingest pipeline be reusued from the base datastream?
- ~~Yes, logs should always be parsed the same regardless of if they're routed through the "sink" data stream or directly to the specialized data stream.~~
- No, not all transformations are idempotent and we also don't want to pay the processing penalty twice.
What should the APIs be for adding new routing rules and customizating datasets? We probably need at least:
- CRUD for routing rules
- CRUD for datasets
The requirements for "structure all logs" will also require similar APIs for (out of scope of this issue):
- CRUD for dataset mappings
- CRUD for dataset pipeline processors

joshdover commented 1 year ago

@grabowskit @ruflin can you confirm our understanding of the UX and the APIs we're discussing?

felixbarny commented 1 year ago

We expect users to be able to use filters in Discover to define a set of data and then choose to create a new dataset/datastream for the documents that match the filter.

Using KQL to create routing rules sounds like a cool feature but also adds complexity. Could we start with a workflow where users manually create a painless condition?

Discover will need a Fleet API to call when the user chooses to create this new dataset. At a high level, this API should allow Discover to:

Create a new index template & ingest pipeline for the new dataset (if the target datatset does not exist)

Do we need to already create a new index template at that point? We could just rely on the default index template for logs-*-*.

One challenge is that a routing rules create a variable number of datasets. For example when routing on {{service.name}}, we don't know all datasets beforehand.

By relying on the logs-*-* template, we don't need to know the concrete datasets beforehand.

Persist the routing rule in a Kibana Saved Object

Update the source dataset's ingest pipeline with the reroute processor to route logs to the new dataset

Why are we persisting routing rules as saved objects and not just as a processor in the routing pipeline? Not having a single source of truth could lead to the reroute processors and saved objects to not be in sync.

On package upgrades of the base package, the "forked" datastream(s) for the new dataset should have its base index template upgraded. What is the base index template? Based on the source dataset + @Custom

I was thinking that we could also just rely on the logs-*-* index template for that.

Which saved object are routing rules saved to?

We agreed that rules should be stored on the target dataset/package/package policy, not the source or "sink"

I'm not sure if that's possible. The reroute processor needs to be added to the pipeline of the sink. Also, a single routing rule can result in creating multiple datasets that are unknown at the time the routing rule is created.

Should the ingest pipeline be reusued from the base datastream? Yes, logs should always be parsed the same regardless of if they're routed through the "sink" data stream or directly to the specialized data stream.

Hm, I see where you're coming from. But if both the sink and the destination have the same pipeline, we would do the same transformation twice and not all transformations are idempotent.

joshdover commented 1 year ago

@felixbarny One thing that is clear from the discussion so far is that the first priority use case is not defined. I'm not sure if we first want to target the simpler use case of "route these specific logs to a new data stream" or the more complex "route all of the logs to multiple data streams based on field X". Let's call these:

Single target data stream
Multiple target data streams, based on a dynamic field

Eventually we'll need to support both, but which are we starting with?

Do we need to already create a new index template at that point? We could just rely on the default index template for logs-*-*.

One challenge is that a routing rules create a variable number of datasets. For example when routing on {{service.name}}, we don't know all datasets beforehand.

By relying on the logs-*-* template, we don't need to know the concrete datasets beforehand.

You're right, I was focusing on use case (1). If there's not a specific target data stream for a routing rule, I agree that the default template can be relied on.

Why are we persisting routing rules as saved objects and not just as a processor in the routing pipeline? Not having a single source of truth could lead to the reroute processors and saved objects to not be in sync.

I'm not sure if that's possible. The reroute processor needs to be added to the pipeline of the sink. Also, a single routing rule can result in creating multiple datasets that are unknown at the time the routing rule is created.

I'm thinking ahead to this "export as a package" aspect. For that to work, I think a specific package needs to own the routing rule, and for use case (1), the destination package/data stream should be the one to own the rule, not the sink. We could use the routing pipeline as the source of truth, but for use case (1) there still needs to be some link between the processor and the destination data stream. Maybe the contents of the reroute processor itself is good enough.

You are right that the reroute processor itself would always be in the "sink" data stream's ingest pipeline, even if it's owned by another package/data stream.

I'm not sure how to think of "export a package" for more dynamic routing rules that fan out to multiple data streams (use case (2)). Do you think these would be a better fit as @custom extensions to the sink data stream or should there also be a backing package for these?

Hm, I see where you're coming from. But if both the sink and the destination have the same pipeline, we would do the same transformation twice and not all transformations are idempotent.

Yeah I thought about this more last night and I agree, this will be too hard to support.

felixbarny commented 1 year ago

Single target data stream

Multiple target data streams, based on a dynamic field

Eventually we'll need to support both, but which are we starting with?

The reroute processor will support both use cases. I think we'll also want to update some integrations to make use of 2. relatively soon. For example, changing the syslog integration to route by app_name or the k8s integration to route by service.name which is inferred via app.kubernetes.io/name or the container name.

I will also be possible for users to just start ingesting data into a custom dataset that's not known beforehand. I don't think we'd want to require them to manually create an integration before they can start ingesting data.

I suppose we'll need to be able to "lazily" create an integration. For example, at the time the user wants to add a pipeline to their custom dataset.

I'm thinking ahead to this "export as a package" aspect. a specific package needs to own the routing rule, and for use case (1), the destination package/data stream should be the one to own the rule, not the sink.

Ah, I see. That makes sense. It would be similar to the built-in Nginx integration adding a routing rule to the kubernetes.container_logs dataset. The Nginx integration owns that rule but it's added to the k8s routing pipeline.

weltenwort commented 1 year ago

@joshdover happy to see that your results are pretty similar to what I naively had in mind :tada:

Discover will need a Fleet API to call when the user chooses to create this new dataset. At a high level, this API should allow Discover to:

Create a new index template & ingest pipeline for the new dataset (if the target datatset does not exist)

Persist the routing rule in a Kibana Saved Object

Update the source dataset's ingest pipeline with the reroute processor to route logs to the new dataset

I wonder if we ever want these assets to fly around without an owning integration. Would this be the place where a new integration is created that owns them?

kpollich commented 1 year ago

Hi all. I'm starting to work on some technical definition for this work around Fleet and the Package Spec. I'd like to walk through a basic example to make sure I understand what we need to support here.

Let's say we support routing rules at the package spec level, e.g.

# In the nginx integration, which is our "destination", let's say a `routing_rules.yml` file
rules:
  - description: Route Nginx Kubernetes container logs to the Nginx access data stream
    source_data_stream: logs-kubernetes.router
    destination_data_stream: logs-nginx.access
    if: > 
      ctx?.container?.image?.name == 'nginx'

❓ I'm basing this logs-kubernetes.router data stream off of what I see in Felix's doc regarding designated "sink" data streams for routing purposes. Is this still accurate here?

When this integration is installed, Fleet will parse out this routing_rules.yml file and add a corresponding reroute processor to the Kubernete's integrations logs-kubernetes.router-{version} ingest pipeline. e.g

{  
  "processors":
  [
    {
      "reroute": {
        "tag": "nginx",
        "if": "ctx?.container?.image?.name == 'nginx'",
        "dataset": "nginx.access"
      }
    }
  ]
}

❓ If the Kubernetes integration isn't installed, would Fleet need to install it at this time? I understand the inverse case, where we'll need to "preinstall" all related component templates + ingest pipelines for "destination" packages as @joshdover mentioned in https://github.com/elastic/kibana/issues/151898#issuecomment-1451760052.

I understand there are other pieces here like

A user-facing API for creating custom routing rules on the fly
The "pre-install assets" workflow mentioned above, and likely the EPR API to support it

...but I'd just like to make sure I'm on the right path with the above example. Thanks!

kpollich commented 1 year ago

Spent some more time with this. I think I've boiled our needs here down into three main feature sets:

Package spec support for routing rules
Optimistic installation of assets for "destination" integrations
Fleet API support for CRUD around routing rules and dataset customizations

I've got some notes on the first 2 points here, but I'm still thinking through number 3.

Here's a napkin sketch overview of what I've been brainstorming:

Package Spec support for routing rules

Goal: allow integrations to ship with routing rules that will generate corresponding reroute processors on the appropriate ingest pipelines

Packages can provide a routing_rules.yml file that includes a list of routing rules
Each routing rule defines
- A source dataset - Where is data coming from?
- A destination dataset - Where should the data end up after routing finishes?
- A condition - How do we know if a document should be routed?
- Additional metadata like name/description for debugging
  - Any generated routing rule from a package should include a tag property with the package's name and version

Package spec: routing_rules.yml file in "destination integrations" e.g. for nginx:

# In the nginx integration
rules:
  - description: Route Nginx Kubernetes container logs to the Nginx access data stream
    source_dataset: kubernetes.router
    destination_dataset: nginx.access
    if: > 
      ctx?.container?.image?.name == 'nginx'

Supporting routing rules in the package spec would be a separate chunk of work from Fleet support, which I'll get into next. We'll need to make sure the spec + EPR endpoints all fully support routing rules as a first class resource to support our other features here.

Optimistic installation of destination dataset assets

Goal: When an integration is installed, Fleet must also install index/component templates + ingest pipelines for any datasets to which the integration might potentially route data.

Fleet needs a means of fetching all integrations to which the "data sink" integration currently being installed might route data. EPR should provide a /routing_rules API that returns all routing rules defined across all packages with a denormalized format like

[
  {
    "integration": "nginx",
    "source_dataset": "kubernetes.router",
    "destination_dataset": "nginx.access",
    "if": "ctx?.container?.image?.name == 'nginx'"
  },
  {
    "integration": "apache",
    "source_dataset": "kubernetes.router",
    "destination_dataset": "apache.access",
    "if": "ctx?.container?.image?.name == 'apache'"
  }
]

This would allow Fleet to perform a lookup for any datasets where source_dataset matches the current integration, and create the corresponding templates + pipeline for that dataset, as defined by the destination integration.

❓ - Is limiting the assets installed for destination datasets necessary, or should we just install everything including Kibana assets? Could we just perform a "standard" installation as part of that process, or is the performance hit of doing that too substantial?

I've got another napkin sketch for this part as well:

I think with these two chunks of work, we'd be able to call package-based routing support done as far as Fleet is concerned. We'd be able to

Define a routing rule in a given package
Install a "source" package that ingests data matching the routing rule above
Resolve the "dependency" between these two packages during installation, and ensure that the reroute processor and destination index template + ingest pipeline exist such that the routed data is mapped and processed correctly

I'll spend some more time with the customization API requirements here, but feel free to chime in with any feedback if you have it before I come back with more.

kpollich commented 1 year ago

Spent some time thinking through the CRUD API needs here, which I'll summarize below.

Fleet will provide an API endpoint for persisting "custom logging integrations" which include:

A dataset
Field mappings ❓
Routing rules
Package metadata like name, title, description, etc

An example API request might look like this:

POST /api/fleet/custom_logging_integrations
{
    "name": "my_custom_integration",
    "title": "My custom integration",
    "description": "Created by {user.name} on {date}",,
    "dataset": "my.application",
    "mappings": [] // ❓
    "routing_rules": [
      {
          "description": "Route Kubernetes container logs for our custom app containers to the custom integration dataset"
          "source_dataset": "kubernetes.router",
          "if": "ctx?.container?.image?.name == 'acme.co/my-application'"
      }
    ]
}

❓ One thing I'm not sure on is the mappings provided. Does the user go through and define fields or select mapping types for the fields detected in their custom logs?

Am I right in thinking these customization APIs are a fairly separate effort from the package-level routing rules work above? I think we could pretty realistically get started on supporting document based routing rules defined by packages without much more definition that what I've done here, but these customization APIs seem like they're still a bit in flux. Persisting these custom integrations (or "exporting" as it's been referred to a few times above) seems like a follow-up to the package level support.

weltenwort commented 1 year ago

Maybe I'm missing something but I get stuck on the following statements repeated several times above:

So if a user installs the k8s integration, Fleet would install the ES assets for all (or most popular) integrations that have routing rules specified for k8s container logs.

[...] we'll need to "preinstall" all related component templates + ingest pipelines for "destination" packages [...]

When an integration is installed, Fleet must also install index/component templates + ingest pipelines for any datasets to which the integration might potentially route data.

Why do we need to install the assets of all potential destination packages when installing the source package? Isn't the explicit installation of the destination package what inserts the routing rule? Then why do we need the destination assets earlier already when there's no routing to it in place?

kpollich commented 1 year ago

Isn't the explicit installation of the destination package what inserts the routing rule? Then why do we need the destination assets earlier already when there's no routing to it in place?

I wasn't 100% clear on this. My assumption was that if we install the kubernetes data-sink integration, we'd want all Kubernetes-related routing rules to also come along with it. It sounds like that may be incorrect. I think I was misunderstanding the "auto installation of integrations" called out in the design doc for this feature.

So, it sounds like the "Optimistic installation of destination dataset assets" isn't necessary here. Only when a user explicitly installs a "destination" integration will any "source" integrations' ingest pipelines be updated.

kpollich commented 1 year ago

My main source of confusion, I think, came from this prior comment

Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install?

kpollich commented 1 year ago

(apologies for the spam)

@hop-dev's comment above is the source of the above conversation:

or if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?

I think this point is still valid. Should we expect users to install the Nginx integration manually because it's a routing destination for Kubernetes container logs? How will we surface that knowledge and make it clear to users in an instance like this? It makes sense technically for us to defer the creation of reroute processors until a manual installation for some destination integration, but users won't have any discoverability into those routing rules in our current design, right?

Here's how I understand this would work without the optimistic installation process:

felixbarny commented 1 year ago

❓ I'm basing this logs-kubernetes.router data stream off of what I see in Felix's doc regarding designated "sink" data streams for routing purposes. Is this still accurate here?

Yes, that's still the plan. But we haven't implemented that, yet.

❓ If the Kubernetes integration isn't installed, would Fleet need to install it at this time?

No, I don't think so. But if the k8s integration is installed at a later time, it should include the rule from the Ngingx package if it has already been installed.

So, it sounds like the "Optimistic installation of destination dataset assets" isn't necessary here. Only when a user explicitly installs a "destination" integration will any "source" integrations' ingest pipelines be updated.

Yes, I think we should start with that.

if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs?

Good question. I think there's no need for the user to install the integration we can just store the raw logs and the users might be fine with that. Maybe we could suggest the user to install the Nginx integration when we detect that the user looks at Nginx logs.

Other questions I have is whether we should install a destination integration when data is sent to a data stream that this integration manages? Also, would there be a way for users to set a label in their containers to give a hint about installing the integration?

Overall, I think the most important routing workflow that we should support for now is not necessarily a destination integration to register a routing rule in a source integration but to do default routing in the source integrations on the service or container name and to enable users to add custom routing rules in source integrations.

ruflin commented 1 year ago

Other questions I have is whether we should install a destination integration when data is sent to a data stream that this integration manages? Also, would there be a way for users to set a label in their containers to give a hint about installing the integration?

I would decouple this completely from routing and labels. This is something we should have in general. If data is shipped to a dataset and we have an existing integration for it, we should recommend it to be installed.

weltenwort commented 1 year ago

Should we expect users to install the Nginx integration manually because it's a routing destination for Kubernetes container logs? How will we surface that knowledge and make it clear to users in an instance like this?

That's a good question. While not explicitly mentioned anywhere, I think this could be one of the workflows started from the log explorer. When the user has selected the "unrefined" k8s logs data stream we could offer a starting point to jump into the workflow that installs a specialized integration (e.g. "nginx"). From then on the nginx docs would not show up in the k8s logs anymore but new nginx-related data streams would show up in the data stream selector.

I wonder if we would want to start off the workflow in a generic manner or if there would be an efficient way to query which integrations have routing rules for k8s. Then we could directly offer a selection of routing-aware integrations at the start of the workflow.

ruflin commented 1 year ago

we could offer a starting point to jump into the workflow that installs a specialized integration (e.g. "nginx")

I really like this idea on a broader scope that this recommendation show up in Discover for the log lines, for example as an action.

kpollich commented 1 year ago

I want to recenter the discussion in this issue based on the dependencies called out in https://github.com/elastic/ingest-dev/issues/1670, which we're using to track all the work that falls on ingest for Logs+.

This issue touches on two pieces of the "Phase 1" dependencies called out in the linked issue:

Document-based routing, and Fleet support for reroute processors in ingest pipelines
Fleet-provided CRUD APIs for dataset customizations, including index/component template creation w/ custom mappings and user-defined routing rules

We're also touching on a "Later phases" piece here which is the on-the-fly creation of "custom integrations" which will wrap these user customizations, assets, etc into an integration that's persist as an epm-package saved object. This is, however, out of scope for the actual implementation plan in this issue. I'd suggest we move further discussions around that feature over to https://github.com/elastic/ingest-dev/issues/1670 or https://github.com/elastic/logs-dev/issues/61.

I'll be updating the description here with some links to child issues in appropriate repos for each piece of work.

joshdover commented 1 year ago

Catching up here after being out for a couple weeks. This is looking great, thank you @kpollich for the diagrams.

I definitely agree that we don't need to do any implicit package installation and should only install the reroute processors if both the source and destination packages are installed. We can later enhance the UX to suggest to users to install packages as needed/desired as others suggested above.

I think it would be helpful to classify these different sources of routing rules into 3 categories, rather than 2. This is also sorted by what I believe to be the priority, highest-to-lowest:

Package-defined rules
User-defined rules for custom packages
User-defined custom rules for existing packages

(1) and (2) can likely share the same data model (the yaml file in the package) and should be put into the ingest pipeline in the same (managed) location. (3) can probably be represented simply as reroute processors in the @custom pipeline we already support to avoid multiple sources of truth.

One more aspect of this that we need to define before we begin implementation of https://github.com/elastic/kibana/issues/155910 is the ordering of the ingest pipeline. We need to know what processing should be done in which order between package-defined data processing, package-defined routing, user-defined routing, and user-defined processing. This is something that I hope @felixbarny and @ruflin can provide input on based on any experimentation being done with the pieces we landed in 8.8.

kpollich commented 1 year ago

Thanks @joshdover - I think distinguishing routing rules for custom packages vs existing packages is helpful. I'll try to make this clear in the dataset customization tasks here.

@juliaElastic - This is another issue it'd be good to catch up on in preparation for your time as interim tech lead in the coming months. Let's chat about this at some point, but please catch up here when you have some spare time 🙏

hop-dev commented 1 year ago

@yaauie , does the logstash elasticsearch pipeline runner support the reroute processor (or plan to)? I could see this affecting the work I saw in your presentation at the all hands.

ruflin commented 1 year ago

We keep talking about package level routing rules. My assumption is, all routing rules are on the dataset level. Either the dataset is the source or the target. All the configs would happen in the dataset manifest?
Setting up the target datasets is out of scope for routing but a general problem we need to address -> create integration. Maybe even Elasticsearch could take this over @felixbarny
3: "User-defined custom rules for existing packages": I expect this is the case of input packages in most scenarios

kpollich commented 1 year ago

We keep talking about package level routing rules. My assumption is, all routing rules are on the dataset level. Either the dataset is the source or the target. All the configs would happen in the dataset manifest?

I think we're just using "package level" to mean "routing rules defined somewhere in a package manifest" - not necessarily that they operate globally on a package.

We can certainly include routing rules as part of the data_streams/foo/manifest.yml file. My initial proposal was a top-level routing_rules.yml file for each package, but I think putting this as the dataset level would also make sense. Something like this for example:

# nginx/data_stream/access/manifest.yml
title: Nginx access logs
type: logs
streams:
  - input: logfile
    vars:
      - ...
    title: Nginx access logs
    description: Collect Nginx access logs
    routing_rules:
      - source_dataset: kubernetes.router
        if: >
          ctx?.container?.image?.name == 'nginx'
 - input: httpson
   vars:
     - ...

ruflin commented 1 year ago

The reason I propose to set this at the dataset level, is that there is always a source and destination dataset involved, I doesn't really happen on the package level. And if no destination dataset can be found, the destination is the source dataset itself.

graph LR;
    A[Source Dataset];
    Doc((Doc));
    B[Destination Dataset];
    ES[(Elasticsearch)]

    Doc--> A;
    A-->B;
    A-->A;
    B-->ES;

In the proposal you have above, you nest the routing rules under the streams / input configs. I don't think the routing rules are related to the inputs / streams in any ways. A package can define routing rules without ever having specified an input. Somehow related: https://github.com/elastic/kibana/issues/155999

Use cases

Lets go through some example use cases with the configs. I would split up the use cases slightly different from the proposal from Josh:

Dynamic package defined routing rules, likely part of an input package
Static package defined routing rules

Dynamic package defined routing rules

This can be done today as a package developer can extend their ingest pipeline. But it would be nice to have a separate definition for it so Fleet is better aware of the routing rules. As use case, lets take k8s routing logs base on the container name. The pipeline defintion could look as following:

reroute:
  tag: logs-k8s.router
  dataset": "{{container.image.name}}"
  namespace":
    - {{labels.data_stream.namespace}}
    - default

This pipeline would have to be installed for logs-k8s.router-*. The destination dataset is dynamic, we can't specify it.

These routing rules can be extended in 2 ways

A package like nginx installs a special rule
A user manually adds a rule for their destination dataset

In both use cases, it can be assumed the target dataset exists as part of an integration. An exception to this if a user would add a pattern with a variable as destination.

Static package defined routing rules

The use case here, is that a package like nginx wants to accept data shipped to logs-nginx-* and route it based on file path or stdout/stderr to the correct dataset. In this scenarion, all dataset are internal to the package. It simplifies data shipping and makes sure also new nginx data is picked up and nginx can be used as fallback.

This scenario is also possible today by just using the ingest pipeline definition, but it would be nice for package devs to have a simpler yaml definition for it. Users are not expected to add their routing rules to it.

Conclusion

Taking the above use cases, there are 2 different cases to cover:

The manifest.yml of the dataset the dev is working on is the source
The manifest.yml of the dataset the dev is working on is the destination

I suggest to take source as the default as I think this is the more common scenario and the simpler one. This could lead to a config like the following:

# k8s/data_stream/router/manifest.yml
title: K8s router for logs
type: logs
routing_rules:
  - dataset": "{{container.image.name}}"
    namespace":
      - {{labels.data_stream.namespace}}
      - default

    # I expect the tag to be generated dynamically by Fleet in a smart way
    #tag: {{logs-k8s.router-?}}

The "source" dataset is automatically the k8s.router dataset. For the nginx use case where it installs it in a different dataset, we can reuse what Kyle proposed above:

# nginx/data_stream/acces/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
  - source_dataset": "k8s.router"
    if: "ctx?.container?.image?.name == 'nginx'"
    namespace":
      - {{labels.data_stream.namespace}}
      - default

    # I expect the tag to be generated dynamically by Fleet in a smart way
    #tag: {{logs-k8s.router-?}}

    # This value will be automatically filled in by Fleet
    #dataset: {{dataset.name}}

If source_dataset is used, dataset cannot be configured differently then the dataset name. If dataset is configured, source_dataset cannot be used. It is possible that in the routing rules arry, both exists. Here an example:

# nginx/data_stream/nginx/manifest.yml
title: Nginx  logs
type: logs
routing_rules:

  # Routing rule for k8s
  - source_dataset": "k8s.router"
    if: "ctx?.container?.image?.name == 'nginx'"
    namespace":
      - {{labels.data_stream.namespace}}
      - default

    # I expect the tag to be generated dynamically by Fleet in a smart way
    #tag: {{logs-nginx-?}}

    # This value will be automatically filled in by Fleet
    #dataset: {{dataset.name}}

  # Routing nginx error logs
  - if: "ctx?.file?.path?.contains('/var/log/nginx/error)'"

    # This value will be automatically filled in by Fleet
    dataset: nginx.error

    # I expect the tag to be generated dynamically by Fleet in a smart way
    #tag: {{logs-nginx-?}}

  # Routing nginx error logs
  - if: "ctx?.file?.path?.contains('/var/log/nginx/access)'"

    # This value will be automatically filled in by Fleet
    dataset: nginx.access

    # I expect the tag to be generated dynamically by Fleet in a smart way
    #tag: {{logs-nginx-?}}

We need to figure out how the tags are generated. I was also tempted to potentially split up the ones with source and destinations into 2 separate arrays. Something like:

routing_rules:
  # Routing rules for the current dataset. I don't like that it requires a name.
  nginx:
    - 
  # Routing rules for other datasets
  k8s.router:
    -

kpollich commented 1 year ago

Thanks, @ruflin . I think I understand everything we're working through here, but I'm going to try and summarize the comment above with my own questions/comments appended as we go along. Forgive my long-windedness 😅

We have two categories of package-defined routing rules

Dynamic routing rules
Static routing rules

Dynamic routing rules

Packages can define routing rules for dynamic datasets + namespaces, which are then installed under an index template that matches on the index pattern logs-{{integration}}.{{dataset}}-*. See example below.

# kubernetes/data_stream/router/manifest.yml
title: K8s router for logs
type: logs
routing_rules:
    - dataset: "{{container.image.name}}" # The `reroute` processor support dynamic values from document properties
      namespace:
        - {{labels.data_stream.namespace}}
        - default # This is a fallback if the document doesn't include the above property

This package-defined routing rule should result in the following ingest pipeline being created by Fleet:

// Ingest pipeline - logs-kubernetes.router-1.2.3
{
    "processors": [
        {
            "reroute": {
                "tag": "logs-k8s.router",
                "dataset": "{{container.image.name}}",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ]
            }
        }
    ]
}

Package installation will also generate an Index Template with a pattern of logs.k8s.router-* which - if I understand correctly - is how Fleet currently works. e.g.

// Index template - logs-kubernetes.router-* (truncated for brevity)
{
  "template": {
    "settings": {
      "final_pipeline": ".fleet-final-pipeline-1",
      "default_pipeline": " logs-kubernetes.router-1.2.3"
    }
  }
}

For this case, I have a few questions:

Dynamic routing rules on "source" dataset like kubernetes.router won't have any if conditions, correct? They work by variable substitution based on document fields - so this pipeline will always fire on every document and attempt to route it.
What happens when a document doesn't contain {{container.image.name}} in the above example? Does the processor fail or does it just not fire, resulting in unrouted documents in the kubernetes.router dataset?
In Ruflin's example above, we have tag: {{logs-k8s.router-?}} as a "proposed" value that Fleet would intelligently fill in. What would be the expectation for that value? As far as I understand, the tag needs to be written at ingest pipeline creation time, so we can't dynamically add a more specific tag that the datastream type + dataset values. Might be missing something there, as we're discussing tag generation as an unknown in general.

The "dynamic routing rules" use case also includes routing rules defined on "destination" datasets, like Nginx, e.g.

# nginx/data_stream/access/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
    - source_dataset: "k8s.router"
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
        - {{labels.data_stream.namespace}}
        - default

The above example would result in a pipeline as follows:

// Ingest pipeline - logs-kubernetes.router. "Destination" routing rules alter the "source" pipeline
{
    "processors": [
        {
            "reroute": {
                "tag": "logs-k8s.router-nginx", // In this case, we could actually generate a dynamic tag that links back to the destination package
                "dataset": "nginx.access",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ],
                "if": "ctx?.container?.image?.name == 'nginx'"
            }
        }
    ]
}

I think the "destination-defined" case here is a little more straightforward than the "source-defined" case which utilizes more variables.

Static routing rules

These would be routing rules that are "local" to a given integration. The example of routing arbitrary documents from logs-nginx-* to a more specific data stream like logs-nginx.access based on some condition is a good one.

So, if we defined a routing rule such as

# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs

# I assume a "routing" datastream would utilize these features - https://github.com/elastic/kibana/pull/154732
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true

routing_rules:
    - if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
      dataset: nginx.error
      namespace:
        - {{labels.data_stream.namespace}}
        - default
    - if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
      dataset: nginx.access
      namespace:
        - {{labels.data_stream.namespace}}
        - default

It'd result in an ingest pipeline, e.g.

// Ingest pipeline - logs-nginx.nginx
{
    "processors": [
        {
            "reroute": {
                "tag": "logs-nginx.nginx",
                "dataset": "nginx.error",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ],
                "if": "ctx?.file?.path?.contains('/var/log/nginx/error')"
            }
        },
        {
            "reroute": {
                "tag": "logs-nginx.nginx",
                "dataset": "nginx.access",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ],
                "if": "ctx?.file?.path?.contains('/var/log/nginx/access')"
            }
        }
    ]
}

My questions/comments here are as follows:

Is the logs-nginx.nginx type/dataset expected here? If we're working with a "router" data stream for this integration and putting it in data_stream/nginx/manifest.yml then I believe this is going to be the resolved dataset. The Kubernetes example uses kubernetes.router for the dataset. Maybe that's a pattern we can adapt for all data stream manifests that define routing rules?
I'd propose we use destination_dataset to be explicit in routing rule definition rather than dataset which aligns with what eventually lands in the generated reroute processor(s). Since source_dataset and dataset can appear in the same array of routing rules, I think erring on the side of explicitness would be helpful in maintainers developing an accurate mental model.

For completeness, I'll also reiterate Ruflin's final examples above and map it to the generated ingest pipelines. I've removed any proposals/speculations and stuck with concrete values based on the above context for clarity here.

# kubernetes/data_stream/router/manifest.yml
title: K8s router for logs
type: logs

elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true

routing_rules:
    - dataset: "{{container.image.name}}"
      namespace:
          - {{labels.data_stream.namespace}}
          - default

# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs

elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true

routing_rules:
    # Route K8s container logs to the Nginx catch-all dataset
    - source_dataset: "k8s.router"
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
          - {{labels.data_stream.namespace}}
          - default

    # Route error logs to the nginx.error dataset
    - destination_dataset: nginx.error
      if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
      namespace: 
          - {{labels.data_stream.namespace}}
          - default

    # Route access logs to the nginx.access dataset  
    - destination_dataset: nginx.access
      if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
      namespace: 
          - {{labels.data_stream.namespace}}
          - default

# nginx/data_stream/nginx/access/manifest.yml
title: Nginx access logs
type: logs
routing_rules:
    # Route K8s container logs to the Nginx access logs data stream
    - source_dataset: "k8s.router"
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
          - {{labels.data_stream.namespace}}
          - default

// Ingest pipeline - logs-kubernetes.router-1.2.3
{
    "processors": [
        {
            "reroute": {
                "tag": "logs-k8s.router",
                "dataset": "{{container.image.name}}",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ]
            }
        },
        {
            "reroute": {
                "tag": "logs-nginx.access",
                "dataset": "nginx.access",
                "if": "ctx?.container?.image?.name == 'nginx'",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ]
            }
        }
    ]
}

// Ingest pipeline - logs-nginx.nginx
{
    "processors": [
        {
            "reroute": {
                "tag": "logs-nginx.nginx",
                "dataset": "nginx.error",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ],
                "if": "ctx?.file?.path?.contains('/var/log/nginx/error')"
            }
        },
        {
            "reroute": {
                "tag": "logs-nginx.nginx",
                "dataset": "nginx.access",
                "namespace": [
                    "{{labels.data_stream.namespace}}",
                    "default"
                ],
                "if": "ctx?.file?.path?.contains('/var/log/nginx/access')"
            }
        }
    ]
}

Does that seem accurate based on everything we've discussed above?

One thing I'm wondering is if the specific Kubernetes container logs routing rule that routes logs to nginx.access is even necessary in this case. It seems like relying on labels/document fields in a "source" data stream will be a more common use case than defining rules in a "destination" data stream as mentioned above.

kpollich commented 1 year ago

Another thing that's come to mind as I'm working on other tasks: we're going to need to be careful with {{}} syntax in pipeline definitions. We're looking to add handlebars support to ingest pipelines in https://github.com/elastic/package-spec/issues/517 (issue still WIP) as part of the data fidelity project. Fleet will need to tolerate template variables that don't resolve in pipeline definitions for cases where data fidelity functionality is present alongside the dynamic syntax in reroute processors.

I think since reroute processors will be generated on-the-fly by Fleet, this should be okay, but it's still worth pointing out that we'll have a syntax collision here as proposed.

felixbarny commented 1 year ago

Dynamic routing rules on "source" dataset like kubernetes.router won't have any if conditions, correct? They work by variable substitution based on document fields - so this pipeline will always fire on every document and attempt to route it.

I think that will mostly be the case but there may also be situations where you have both an if condition and a field reference in the target dataset or namespace. For example "route to {{data_stream.dataset}}, unless it has the same value as the current dataset".

What happens when a document doesn't contain {{container.image.name}} in the above example? Does the processor fail or does it just not fire, resulting in unrouted documents in the kubernetes.router dataset?

It would use the current dataset as a default (kubernetes.router) See also https://www.elastic.co/guide/en/elasticsearch/reference/master/reroute-processor.html

In Ruflin's example above, we have tag: {{logs-k8s.router-?}} as a "proposed" value that Fleet would intelligently fill in. What would be the expectation for that value? As far as I understand, the tag needs to be written at ingest pipeline creation time, so we can't dynamically add a more specific tag that the datastream type + dataset values. Might be missing something there, as we're discussing tag generation as an unknown in general.

Good question. I'm not sure if we can auto-assign a tag. I was thinking that we might have an id field that the package dev needs to set that's required to be unique. We could use that as the value for tag.

// Ingest pipeline - logs-kubernetes.router-1.2.3

The ordering of the processors is not right here. The first reroute processor doesn't have a condition - therefore it's always going to be executed and will short-circuit the rest of the pipeline.

ruflin commented 1 year ago

I keep stumbling over the source/destination definition, especially that the syntax is not exactly the same as in the pipeline which I think will cause problems down the line. Here an alternative idea:

# Routing rules defined for THIS dataset
routing_rules:
    # Route error logs to the nginx.error dataset
    - dataset: nginx.error
      if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
      namespace: 
          - {{labels.data_stream.namespace}}
          - default

    # Route access logs to the nginx.access dataset  
    - dataset: nginx.access
      if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
      namespace: 
          - {{labels.data_stream.namespace}}
          - default

# Routing rules defined for a different source dataset         
source_routing_rules:
  # Routing rules for k8s
  k8s.router:
    # Route K8s container logs to the Nginx catch-all dataset
    - # Dataset must be validated to be the same as the current dataset
      dataset: "nginx"
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
          - {{labels.data_stream.namespace}}
          - default

  # Made up example, ignore that it looks like k8s, it is to make the point 
  # multiple dataset can be specified
  syslog:
    - # Dataset must be validated to be the same as the current dataset
      dataset: "nginx"
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
          - {{labels.data_stream.namespace}}
          - default

I consider the source use case the more complex one and the one that is less often used.

felixbarny commented 1 year ago

Packages can define routing rules for dynamic datasets + namespaces, which are then installed under an index template that matches on the index pattern logs-{{integration}}.{{dataset}}-*.

Integrations should also be able to route to logs-{{dataset}}-* and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs-*-* index pattern. See also https://github.com/elastic/elasticsearch/issues/95538

But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*. That's because the reroute processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}", for example. Do you think that would be required?

Is the logs-nginx.nginx type/dataset expected here? If we're working with a "router" data stream for this integration and putting it in data_stream/nginx/manifest.yml then I believe this is going to be the resolved dataset.

The nginx.nginx dataset looks a little strange tbh. I'd expect either just nginx or nginx.router.

The Kubernetes example uses kubernetes.router for the dataset. Maybe that's a pattern we can adapt for all data stream manifests that define routing rules?

I'm going back-and-forth on that. I think it makes sense as a convention for datasets that aren't expected to contain any data and just do routing. However, some datasets, such as syslog may not contain any default routing rules but users may choose to add some.

I'd propose we use destination_dataset to be explicit in routing rule definition rather than dataset which aligns with what eventually lands in the generated reroute processor(s). Since source_dataset and dataset can appear in the same array of routing rules, I think erring on the side of explicitness would be helpful in maintainers developing an accurate mental model.

I think we should not have them both appear in the same array. I'd even split these to different files. One file that has all the routing rules that go do the routing pipeline of the current dataset and another file that lets you add routing rules to other datasets. To me, that's the main distinction between the two different use cases rather than dynamic vs static rules: Routing rules for the current dataset and rules that are injected into other datasets.

For rules that are injected into other datasets, we'll need to add a priority concept so that they're sorted accordingly. The ordering also necessitates having an identity for routing rules. The injected rules should also always go before any routing rules that the source dataset has defined itself. We probably also want to have a dedicated pipeline for the injected routing rules.

To keep things simple for now, I think we should focus on the routing rules that are just added to the same dataset and not spend too much time on implementing the rule injection.

kpollich commented 1 year ago

I think that will mostly be the case but there may also be situations where you have both an if condition and a field reference in the target dataset or namespace. For example "route to {{data_stream.dataset}}, unless it has the same value as the current dataset".

👍

It would use the current dataset as a default (kubernetes.router) See also https://www.elastic.co/guide/en/elasticsearch/reference/master/reroute-processor.html

Perfect. Users will just need to be aware that documents that fail to route will remain in the "sink" data set. No real concerns here on my end.

Good question. I'm not sure if we can auto-assign a tag. I was thinking that we might have an id field that the package dev needs to set that's required to be unique. We could use that as the value for tag.

The dataset value should be guaranteed unique by package validation, e.g. nginx.access can't appear in multiple packages. If we append a unique routing rule name/ID to that I think it'd be the most valuable option. It's something that might be annoying to integration maintainers though - coming up with a unique name for all their rules is a bit of a burden.

The ordering of the processors is not right here. The first reroute processor doesn't have a condition - therefore it's always going to be executed and will short-circuit the rest of the pipeline.

Good catch. I wasn't sure on how we'd want to guarantee order here. Should the order be based on the order of processors as they appear in the YAML, with conditionless processors pushed to the end of the list? Part of me just wants to honor the order as they appear in the integration, but again it's more burden on the maintainers to understand the implementation details of reroute processors.

I keep stumbling over the source/destination definition, especially that the syntax is not exactly the same as in the pipeline which I think will cause problems down the line. Here an alternative idea:

I'm not 100% sure about aligning package spec fields exactly with Elasticsearch APIs, fields, etc. It's not something we've been consistent about, but maybe that should change here.

I do like the example of splitting these rules into different arrays rather than trying to reason about a mixture of use cases in a single list. Then, like you mentioned, we don't have to introduce new names for the existing concept of dataset on routing rules - it always means the same thing it does in the reroute processor docs.

If dataset values under source_routing_rules are always guaranteed to the be the current dataset via validation, does it make sense to even include that field, or can this be something we document in the spec and prevent user input entirely for that field?

Integrations should also be able to route to logs-{{dataset}}- and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs--* index pattern. See also https://github.com/elastic/elasticsearch/issues/95538

Integration assets generated by EPM are prefixed in most cases with the integration name. Would this mean Fleet needs to create an index template with a different pattern for some cases like this?

But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*. That's because the reroute processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}", for example. Do you think that would be required?

I don't think I completely follow this. Could you provide an example of what this routing rule setup would look like or a use case?

The nginx.nginx dataset looks a little strange tbh. I'd expect either just nginx or nginx.router.

The plain nginx value would require a special case implemented in the package spec. nginx.router will work as expected with no additional implementation, so I'm in favor of that.

I'm going back-and-forth on that. I think it makes sense as a convention for datasets that aren't expected to contain any data and just do routing. However, some datasets, such as syslog may not contain any default routing rules but users may choose to add some.

Hmm it actually might make sense that we need to support a dataset that's only the integration name if we have routing rules with dataset: {{container.image.name}} now that I think about it. We'll have to have a dataset of just nginx or apache in order to route logs based on that value.

I think we should not have them both appear in the same array. I'd even split these to different files. One file that has all the routing rules that go do the routing pipeline of the current dataset and another file that lets you add routing rules to other datasets. To me, that's the main distinction between the two different use cases rather than dynamic vs static rules: Routing rules for the current dataset and rules that are injected into other datasets.

Yeah I'm +1 on splitting these rules into two distinct lists.

For rules that are injected into other datasets, we'll need to add a priority concept so that they're sorted accordingly. The ordering also necessitates having an identity for routing rules. The injected rules should also always go before any routing rules that the source dataset has defined itself. We probably also want to have a dedicated pipeline for the injected routing rules.

To keep things simple for now, I think we should focus on the routing rules that are just added to the same dataset and not spend too much time on implementing the rule injection.

Fair enough. Using Ruflin's example above we'd focus first on supporting routing_rules and then move on to supporting source_routing_rules as a second pass. I'm fine with that approach and it helps us narrow the scope for the initial implementation here.

ruflin commented 1 year ago

The plain nginx value would require a special case implemented in the package spec. nginx.router will work as expected with no additional implementation, so I'm in favor of that.

This is possible today, you have to set dataset: nginx in the manifest.yml.

kpollich commented 1 year ago

I took a pass at updating https://github.com/elastic/package-spec/issues/514 based on the conversation above and a quick offline chat I had with @ruflin. I think the key part of this is the example manifest.yml file, which I'll copy/paste here for reference

# nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs

# This is a catch-all "sink" data stream that routes documents to 
# other datasets based on conditions or variables
dataset: nginx

# Ensures agents have permissions to write data to `logs-nginx.*-*`
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true

routing_rules:
  # Route error logs to `nginx.error` when they're sourced from an error logfile
  - dataset: nginx.error
    if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
    namespace:
      - {{labels.data_stream.namespace}}
      - default

  # Route access logs to `nginx.access` when they're sourced from an access logfile
  - dataset: nginx.access
    if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
    namespace:
      - {{labels.data_stream.namespace}}
      - default

injected_routing_rules:
  # Route K8's container logs to this catch-all dataset for further routing
  k8s.router: 
    - dataset: nginx # Note: this _always_ has to be the current dataset - maybe we can infer this?
      if: "ctx?.container?.image?.name == 'nginx'"
      namespace:
        - {{labels.data_stream.namespace}}
        - default

  # Route syslog entries tagged with nginx to this catch-all dataset
  syslog:
    - dataset: nginx
      if: "ctx?.tags?.contains('nginx')"
      namespace:
        - {{labels.data_stream.namespace}}
        - default

felixbarny commented 1 year ago

Integrations should also be able to route to logs-{{dataset}}- and just rely on the default logs index template instead of setting one up on their own. This is why we want to add dynamic ECS templates to the logs--* index pattern. See also https://github.com/elastic/elasticsearch/issues/95538

Integration assets generated by EPM are prefixed in most cases with the integration name. Would this mean Fleet needs to create an index template with a different pattern for some cases like this?

I hope that's not what it means. I was thinking that we'd just rely on the logs-*-* index template that's embedded in ES rather than setting up a more specific index template with Fleet. But that means there'll be a difference between data streams that are set up via fleet an the ones that just use the default index template in Elasticsearch.

Maybe that's ok. If it's not, we'll need to think about how we could prefix the dataset with the integration name or how to add features to ES that would allow us to rely on the built-in logs-*-* index template. I guess the main thing that we need to do is to mirror the component template and ingest pipeline extension points.

But you bring up a good point. I don't think it's currently possible to route to logs-{{integration}}.{{dataset}}-*. That's because the reroute processor doesn't support the full mustache syntax, so you can't set dataset: "k8s.{{labels.dataset}}", for example. Do you think that would be required?

I don't think I completely follow this. Could you provide an example of what this routing rule setup would look like or a use case?

Let's take the following reroute processor as an example:

- reroute:
  dataset: "{{service.name}}"

The resulting data stream would look like logs-{{service.name}}-default. We can't set up index templates for that in Fleet as we have no control over the service.name field that's sent via the documents.

The reroute processor doesn't support something like this:

- reroute:
  dataset: "foo.{{service.name}}"

The example manifest.yml looks good to me. Out of a personal preference, I'd create dedicated files for the routing_rules and injected_ routing_rules sections as I find that more consistent with what we're doing for ingest pipeline definitions. But whatever feels more intuitive to developers that will actually use these features in anger is fine with me.

felixbarny commented 1 year ago

Good catch. I wasn't sure on how we'd want to guarantee order here. Should the order be based on the order of processors as they appear in the YAML, with conditionless processors pushed to the end of the list? Part of me just wants to honor the order as they appear in the integration, but again it's more burden on the maintainers to understand the implementation details of reroute processors

I think for the routing_rules section, the answer is relatively simple: they should just be used in the same order as they're specified. Any injected routing rule should be executed before the integration's own routing_rules. That's because the routing_rules often include a catch-all rule that always gets executed. If multiple integrations want to inject routing rules into the same routing dataset (for example, both the niginx and the apache integration want to inject rules to k8s), we might need to expose a way for package developers to define a precedence. However, most of these injected rules should be mutually exclusive, so the ordering shouldn't matter. But it may have performance implications which rules are executed first vs last. I'm inclined to not add this to the initial scope for injected_routing_rules and see if a random or alphabetical order is good enough.

weltenwort commented 1 year ago

I like how the two routing rule concepts have been narrowed down. But I wonder if there even is a need for the routing_rules in the package manifest. Would it make sense to instead only have "injected" rules? In this case it would mean that the nginx.access manifest can specify to inject rules into nginx the same way it can inject them into k8s. Then what do we need the other rules "direction" for? What am I missing?

weltenwort commented 1 year ago

What I haven't been able to find in the description so far is whether the installation of the routing rules always happens or if the user gets a choice of which of the available routing rules they want to inject into the "source integration".

If we didn't make installing the rules opt-in, the user couldn't easily install the k8s integration in parallel to the nginx integration without them influencing each other. Wouldn't that be a valid use-case too?

felixbarny commented 1 year ago

Would it make sense to instead only have "injected" rules?

For the scenario in that particular example I think you're right. But it's needed for use cases like these:

type: logs
dataset: k8s
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:  
  - dataset: {{kubernetes.container.name}}

Hmm, good point about making routing rule injection opt-in. I guess that's another reason why we'd want to have both ways: injected and local routing rules as we can rely on local rules to always be installed. So while the nginx.access and nginx.error data sets could inject routing rules to nginx, if we make injection optional, we can't rely on the rules being installed.

ruflin commented 1 year ago

Would it make sense to instead only have "injected" rules? In this case it would mean that the nginx.access manifest can specify to inject rules into nginx the same way it can inject them into k8s.

The reason I like it in the manifest is because routing rules as ingest pipeline is more an implementation detail and I would prefer that package devs do not have to think through were to put the rules in ingest pipelines. Having is separate, will also allow us to "manage" these rules and show them to our users without having to read ingest pipelines.

making routing rule injection opt-in

This is a more generic feature I would like to see in the package manager: Users have an option to remove some of the assets / not install them. Like for example dashboards that are not needed or routing rules. And if needed later, it can be added.

weltenwort commented 1 year ago

But it's needed for use cases like these: [...] dataset: {{kubernetes.container.name}} [...]

Isn't it only needed because there is no k8s.container (or similar) dataset in the k8s package that could inject the rule into the k8s dataset's pipeline?

in the manifest is because routing rules as ingest pipeline is more an implementation detail

I agree, and I'm not making an argument for adding it to the ingest pipeline directly. I was suggesting that we might get by with just the "injected" rules if we define them in the manifest of the "leaf" data streams instead of the package.

The downside would be that we'd need to add some datasets only for the purpose of routing, but on the upside we'd only have a single way to write rules.

elastic / kibana

[Fleet] Support for document-based routing via ingest pipelines #151898

Links

Overview diagram

Integration-defined routing rules

Supporting tasks

Package spec for routing rules

Ingest pipeline design

Fleet API for adding user-defined routing rules

Internal tracking of rules and ingest pipeline updating procedure

User flow

Example

Questions & some potential answers

Package Spec support for routing rules

Optimistic installation of destination dataset assets

Use cases

Dynamic package defined routing rules

Static package defined routing rules

Conclusion

Dynamic routing rules

Static routing rules