[Fleet] Ingest pipelines of old package version are deleted on upgrade

juliaElastic commented 1 year ago

Kibana version: 8.3.1 and likely 8.4, 8.2 and others

Describe the bug: After upgrading a package (integration), the ingest pipelines (and likely other assets) with the old version are deleted, and recreated with the new version suffix. This causes an error for those agents that are not upgraded yet, and still trying to use the old pipeline.

Steps to reproduce:

Force install windows 1.12.1 version package
Add integration policy to an agent policy that has running agents
Upgrade package to 1.12.4 or latest
The ingest pipelines for 1.12.1 are deleted e.g. logs-windows.sysmon_operational-1.12.1
Those agents that are not yet upgraded, are failing to index the data as the pipeline is not found

Expected behavior: The ingest pipeline for the old version should not be deleted as there are agents using it.

Provide logs and/or server output (if relevant):

{
  "log.level": "warn",  "@timestamp": "2022-08-25T14:06:25.916-0500",
  "log.logger": "elasticsearch", "log.origin": { "file.name": "elasticsearch/client.go", "file.line": 429 },
  "message": "Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.August, 25, 19, 5, 53, 934172600, time.UTC), Meta:{\"raw_index\":\"logs-windows.sysmon_operational-datacenter_msexchange\"}, Fields:{\"agent\":{\"ephemeral_id\":\"10adb7bf-3805-4f1c-9b58-839c7151fa9f\",\"id\":\"88844081-50a2-4660-8687-c4831e58ca9d\",\"name\":\"NLR-A-EXC05\",
  \"type\":\"filebeat\",\"version\":\"8.3.1\"},\"data_stream\":{\"dataset\":\"windows.sysmon_operational\",\"namespace\":\"datacenter_msexchange\",\"type\":\"logs\"},\"ecs\":{\"version\":\"8.0.0\"},\"elastic_agent\":{\"id.
...
ProcessTerminate)\",\"user\":{\"domain\":\"NT AUTHORITY\",\"identifier\":\"S-1-5-18\",\"name\":\"SYSTEM\",\"type\":\"User\"},\"version\":3}}, Private:(*cursor.updateOp)(0xc007a5c480), TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): 
{\"type\":\"illegal_argument_exception\",\"reason\":\"pipeline with id [logs-windows.sysmon_operational-1.12.1] does not exist\"}, dropping event!", "service.name": "filebeat", "ecs.version": "1.6.0"
}

elasticmachine commented 1 year ago

Pinging @elastic/fleet (Team:Fleet)

jen-huang commented 1 year ago

@kpollich Curious to know what you think about this. Could a similar problem happen for other types of assets other than ingest pipelines? A solution that "waits until all relevant agents are upgraded" does not seem trivial (or even viable..).

jdixon-86 commented 1 year ago

@jen-huang Customer here that brought this case to support. What I have noticed in my environment is I may have multiple "Windows" (or other) integration policies. Before upgrading the integration policies I test them on one to make sure everything works correctly but doing this seems to remove important assets such as ingest pipelines. When this happens it breaks all the integration policies that have not been upgraded yet.

I think a simple solution would be when a integration policy is upgraded it performs a check to verify it is the last integration policy to upgrade before removing the older assets.

kpollich commented 1 year ago

@kpollich Curious to know what you think about this. Could a similar problem happen for other types of assets other than ingest pipelines? A solution that "waits until all relevant agents are upgraded" does not seem trivial (or even viable..).

"Waiting until all agents are upgraded" during the policy upgrade process is indeed not very feasible here.

It's probably a better idea to query for any agents that are still running on the previous version of an integration before we initiate the cleanup process after a successful upgrade. Right now, this cleanup process is destructive and isn't aware of non-upgraded agents that may still be deployed.

I think a simple solution would be when a integration policy is upgraded it performs a check to verify it is the last integration policy to upgrade before removing the older assets.

This sounds good for the policies themselves, but the larger problem at hand is that non-upgraded agents will fail to ingest. I think a check for "are there still agent policies deployed that point at a policy containing the outdated version of this integration?" before the cleanup process fires after a single policy is upgraded will fix both cases.

@juliaElastic - One thing I'm worried about here is "eventual consistency". If we forego the asset cleanup process while there are still agents or policies active that reference an outdated version of a package, we'd need some way to make sure cleanup still happens after the bulk operation completes. Even if we say "check for agent/policy references to the outdated integration version before cleaning up assets", we process bulk actions async, so there's no guarantee that the last agent to upgrade will hit the cleanup logic at a point where all other agents have finished upgrading.

Do we have any concept of callbacks/hooks to accomplish this in the current bulk actions architecture?

joshdover commented 1 year ago

5. Those agents that are not yet upgraded, are failing to index the data as the pipeline is not found

I'm a little concerned on how this is happening in the first place. The integration policy / agent policy should not reference an ingest pipeline directly at all, it should be solely controlled by the default_pipeline setting on the data stream and the integration should just be pushing data to a specific data stream name. This way the integration policy is less coupled to specific assets in Elasticsearch and can be upgraded without synchronization.

Do we have integrations that are specifying an explicit pipeline?

joshdover commented 1 year ago

Took a look at this integration and it doesn't have an explicit reference to a specific pipeline. It's still not clear to me how this is happening then. There should be no dependency between the agent policy and a specific ingest pipeline version existing. This definitely needs further investigation.

lucabelluccini commented 1 year ago

As a side note, I've been informed Fleet is attempting to do a update index.default_pipeline on all the "past" backing indices, which can obviously fail as some indices might be read-only (e.g. searchable snapshots).

lucabelluccini commented 1 year ago

The best approach here would be not never remove anything and never change the existing indices settings. We should purely rely on updating index templates (and associated resources) and then trigger a rollover.

We might offer a purge of unused resources/assets.

criamico commented 1 year ago

I reproduced this behavior locally on 8.7.0-snapshot:

Installed an older version of windows (windows-1.12.1) and added to an agent policy which has running agents. The related ingest pipelines visible under stack management are as follows:

Then I upgraded to the latest version (1.17.0). The ingest pipelines are now all updated to the newer version:

The code responsible for the removal of the pipelines can be found here:
https://github.com/elastic/kibana/blob/52d9dbffaa27aa14e869b105fe42e6c22c10fff6/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts#L245-L262

This code has been nearly unchanged for long time, probably since 7.x, so I think it requires some discussion on what's the best way of dealing with it. @kpollich

kpollich commented 1 year ago

@criamico - Was this reproduced with running agents and a rollout of the upgrade? The key issue here is that as soon as we roll out the first agent policy upgrade, any agents w/ an outdated policy immediately begin failing to ingest because the cleanup has already run:

Those agents that are not yet upgraded, are failing to index the data as the pipeline is not found

My comment here details what I think is a valid path forward to resolving this:

I think a check for "are there still agent policies deployed that point at a policy containing the outdated version of this integration?" before the cleanup process fires after a single policy is upgraded will fix both cases.

However, it's not clear why agents with outdated policy revisions immediately fail to ingest when their associated pipeline is removed. As Josh said here:

There should be no dependency between the agent policy and a specific ingest pipeline version existing.

We need to figure out why exactly when a pipeline is deleted, agents begin failing to send data. The fact that the newer-versioned pipeline exists should allow those agents to continue ingesting, as ingest pipelines generated by integrations are guaranteed to be backwards compatible.

It'd be good to audit the index/component templates generated here, as well as the ingest pipeline definitions to see what we might be missing. When an upgrade happens, a new index template should be created with default_pipeline set to the new ingest pipeline. The old agents should still be writing to the same data stream pattern, so the index template resolution should work in such a way that any writes to that data stream immediately start using the settings defined by the new index template. If this isn't working I think it's a regression. Could something in Elasticsearch have changed that invalidates our expectations here?

Edit: Add diagram

lucabelluccini commented 1 year ago

I am commenting here just to provide some experience about this.

We need to figure out why exactly when a pipeline is deleted, agents begin failing to send data. The fact that the newer-versioned pipeline exists should allow those agents to continue ingesting, as ingest pipelines generated by integrations are guaranteed to be backwards compatible.

Users PRIOR 8.6 who were cloning the index template to customize their ILM policies per namespace were ending up having as index.default_pipeline: <type>-<package>....-<integration version> with the old integration version. The cloned Index Templates will stay with the "old referenced pipeline". There's no one taking care of them. If those pipelines are deleted, the ingestion will fail. Upgrading, those Index Templates will always "win" over the new default Index Template as they have a more specific index pattern and priority. So they will always end up matching against the newly installed Index Template.

E.g. Nginx integration 1.6.0

User cloned metrics-nginx.stubstatus into metrics-nginx.stubstatus-mynamespace, updated the priority and the index_pattern. Due to cloning, they also have index.default_pipeline set to metrics-nginx.stubstatus-1.6.0 at Index Template level.

User updates Nginx integration to 1.7.0, metrics-nginx.stubstatus gets overwritten but will never be used as the Index Template metrics-nginx.stubstatus-mynamespace will always win over the default one and it will still reference metrics-nginx.stubstatus-1.6.0. If the stack version is 8.6+, even if metrics-nginx.stubstatus@package component template references metrics-nginx.stubstatus-1.7.0, it is "overridden" by the Index Template level default_pipeline.

Also, for users which had a custom ILM policy per namespace on 8.N (N < 6) upgrading to 8.6, it also triggers a breaking change as the Index Template will still reference to the OLD index.default_pipeline. Even if index.default_pipeline is defined at package component template level, the one at Index Template level wins. (I've described it on the KB 65e78ded, reach me out if you cannot find it).

When an upgrade happens, a new index template should be created with default_pipeline set to the new ingest pipeline.

We are actually overwriting the Index Templates & assets, which is something already different from what we did with Filebeat (where Index Templates had a version in their name). Overwriting is not necessarily a problem.

The old agents should still be writing to the same data stream pattern, so the index template resolution should work in such a way that any writes to that data stream immediately start using the settings defined by the new index template.

No, the data stream will pick up the new index template on the next rollover (manual or automatic).

Example

Let's imagine we are on 8.6.x and we're using nginx.stubstatus metrics only in the Nginx integration. We upgrade it from version 1.7.0 to an hypothetical 1.9.0

We have an index template metrics-nginx.stubstatus:

{
  "index_templates": [
    {
      "name": "metrics-nginx.stubstatus",
      "index_template": {
        "index_patterns": [
          "metrics-nginx.stubstatus-*"
        ],
        "template": {
          "settings": {},
          "mappings": {
            "_meta": {
              "package": {
                "name": "nginx"
              },
              "managed_by": "fleet",
              "managed": true
            }
          }
        },
        "composed_of": [
          "metrics-nginx.stubstatus@package",
          "metrics-nginx.stubstatus@custom",
          ".fleet_globals-1",
          ".fleet_agent_id_verification-1"
        ],
        "priority": 200,
        "_meta": {
          "package": {
            "name": "nginx"
          },
          "managed_by": "fleet",
          "managed": true
        },
        "data_stream": {
          "hidden": false,
          "allow_custom_routing": false
        }
      }
    }
  ]
}

The metrics-nginx.stubstatus@package:

{
  "component_templates": [
    {
      "name": "metrics-nginx.stubstatus@package",
      "component_template": {
        "template": {
          "settings": {
            "index": {
              "lifecycle": {
                "name": "metrics"
              },
              "codec": "best_compression",
              "default_pipeline": "metrics-nginx.stubstatus-1.7.0", <<<<
              "mapping": {
                "total_fields": {
                  "limit": "10000"
                }
              },

If we upgrade, we're going to overwrite those assets, but the CURRENT backing index will still be using metrics-nginx.stubstatus-1.7.0. It will be in use until we force a rollover (or a rollover naturally occurs).

It is also crucial that at any time we NEVER try to touch the existing old backing indices of the data stream and we do not attempt to do PUT ..../_settings on them. Let's never do PUT metrics-nginx.stubstatus/_settings as it would affect ALL backing indices (it doesn't scale + indices might be in searchable snapshots). In some situation we got hints something like that is being done by Fleet and it shouldn't. Can we confirm we do not do anything like that?

An upgrade flow might be:

Overwrite/add new assets
Force a rollover on each data stream (this is still risky as it might trigger a massive number of indices/shards being created. We might need to throttle this or should be discussed with the Elasticsearch team)

For old assets, I would propose to not touch them at all or something in the UI to identify unused resources. Also keep in mind those resources might be needed for reindexing purposes (e.g. the user realizes the indexing went wrong 2 months after a problem occurred and they want to reindex the data using the old pipeline).

kpollich commented 1 year ago

Many, many thanks for the extremely thoughtful and detailed response @lucabelluccini - this is incredibly helpful and I appreciate your work here immensely.

I want to touch on rollover logic around package upgrades, as it's brought up a few times in your writeup. My previous understanding was we force a rollover of data streams on package policy update. Looking through the Fleet codebase, this is not the case. We'll only force a rollover in the case where the new template's mappings are incompatible with the old one, resulting in an error. See this old PR: https://github.com/elastic/kibana/pull/69180

It makes sense that we don't do this, as you've mentioned that a forced rollover can be very risky and can incur massive performance costs.

So, because we don't immediately rollover data streams upon deleting the old index/component templates, it makes sense that we'll have ingest failures until data streams roll over. At least if I'm understanding things correctly that's what I think is going on here. I do think the suggestion of leaving assets behind with some sort of metadata and a way to opt-in to cleanup for users is probably a good path forward, but I need to think about it more.

It is also crucial that at any time we NEVER try to touch the existing old backing indices of the data stream and we do not attempt to do PUT ..../_settings on them. Let's never do PUT metrics-nginx.stubstatus/_settings as it would affect ALL backing indices (it doesn't scale + indices might be in searchable snapshots). In some situation we got hints something like that is being done by Fleet and it shouldn't. Can we confirm we do not do anything like that?

As far as I can tell, no - Fleet does not PUT /_settings to any backing indices at any time. The only direct PUT to _settings Fleet ever makes is related to the uninstall process for packages in the case a plain index is included in a package's assets. This can happen for packages with ML jobs or transforms that rely on plain destination indices that are managed by Fleet. We don't touch backing indices directly at all in Fleet's codebase as far as I can tell.

For old assets, I would propose to not touch them at all or something in the UI to identify unused resources. Also keep in mind those resources might be needed for reindexing purposes (e.g. the user realizes the indexing went wrong 2 months after a problem occurred and they want to reindex the data using the old pipeline).

Circling back to this, I do think it's a good path forward. However, I'm not sure of the "signal to noise ratio" problem here when a user upgrades a package across many versions. If we indefinitely keep versioned index/component templates + ingest pipelines between all releases, I could see the stack management UI becoming quite difficult to reason about. However, maybe the reindexing use case is very compelling so I'd like to consider it.

An easy path forward would be to simply omit index/component templates and ingest pipelines from the automatic cleanup process that fires on package installation. Currently, these two cleanup processes are located here in code:

When Fleet "prepares" the templates as part of package installation process, we'll delete previously referenced templates here:

https://github.com/elastic/kibana/blob/fe5ec462336d7e81ec3ad9418b9ff1327be815bc/x-pack/plugins/fleet/server/services/epm/elasticsearch/template/install.ts#L80-L84

When Fleet installs a package, we'll delete any ingest pipelines previously referenced by the package, or any left over from a fail upgrade (e.g. same version being installed now):

https://github.com/elastic/kibana/blob/fe5ec462336d7e81ec3ad9418b9ff1327be815bc/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts#L248-L274

I think if removed these two pieces of cleanup and replaced them with a manual action available to users, it might fix a lot of the issues we're seeing around ingest failures immediately following upgrades and before the next data stream rollover occurs.

@joshdover - cc'ing you as I think you'd find this conversation interesting, and perhaps you have something to offer.

joshdover commented 1 year ago

As far as I can tell, no - Fleet does not PUT /_settings to any backing indices at any time. The only direct PUT to _settings Fleet ever makes is related to the uninstall process for packages in the case a plain index is included in a package's assets. This can happen for packages with ML jobs or transforms that rely on plain destination indices that are managed by Fleet. We don't touch backing indices directly at all in Fleet's codebase as far as I can tell.

I think this key detail is not quite right. Here's what I see and I believe has been the case since Fleet GA:

After the new index templates and ingest pipelines we update the existing data streams here: https://github.com/elastic/kibana/blob/3ae7d70c803dbc091cafc5d561625893ae1afacc/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts#L238
This will first attempt to update mappings, if that fails due to incompatibility the data stream will be rolled over. Following this, we also update the default_pipeline setting on the current write indices with a PUT settings call: https://github.com/elastic/kibana/blob/3ae7d70c803dbc091cafc5d561625893ae1afacc/x-pack/plugins/fleet/server/services/epm/elasticsearch/template/template.ts#L680
Only after this do we delete the old pipelines: https://github.com/elastic/kibana/blob/3ae7d70c803dbc091cafc5d561625893ae1afacc/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts#L254

If there is a bug in 8.6, I suspect it could be from this PR where we made some changes to this logic, though nothing is jumping out at me as problematic: https://github.com/elastic/kibana/pull/124013/files#diff-448f4120f496fd32751c5d798c23eb666a2e0063c1b1607f3c2ef1ec232aa2b3R492. @hop-dev any thoughts?

lucabelluccini commented 1 year ago

After the new index templates and ingest pipelines we update the existing data streams here This will first attempt to update mappings, if that fails due to incompatibility the data stream will be rolled over. Following this, we also update the default_pipeline setting on the current write indices with a PUT settings call

From the code it seems we:

do the put mappings only on the write index (thanks to write_index_only: true), which is great
do the put settings on all the backing indices (we only exclude CCR'd indices) of the data streams - I think the put settings might be ok on any kind of index (I'm not sure about searchable snapshots).

Only after this do we delete the old pipelines

Indeed, otherwise we would get pipeline [...] cannot be deleted because it is the default pipeline for N index(es) including [...].

In order to do delete the old ingest pipeline, we would need to be able to detect any backing index of data streams making use of a given ingest pipeline.

How do we detect the association between datastream <> mapping/pipeline to use?

Are we relying on the ...@package component template or on the index template? If we use the index template, we likely miss all the cloned index templates.

Keeping aside the use case of reindexing, I think this is a point worth discussing with the Elasticsearch team. Possible ideas (thinking out loud):

Elasticsearch to remove the ingest pipeline references after rollover, if the index belongs to a data stream (once rolled over, it should no more receive index requests). This is likely a breaking change on ES APIs.
Elasticsearch to offer a request parameter for the delete ingest pipeline API which prevents you from removing the ingest pipeline only if the pipeline is referenced by a currently written index. E.g. DELETE _ingest/pipeline/somepipeline?allow_delete_if_no_write_index=true.

Especially the point (2) would not be a breaking change (it's opt-in) and would allow us to do the PUT <DATA STREAM>/_settings?write_index_only=true, followed by a DELETE _ingest/pipeline/somepipeline?allow_delete_if_no_write_index=true.

At the end, we allow users to set an unexisting pipeline:

PUT theindex/_settings
{
  "index.default_pipeline": "this_pipeline_do_not_exist"
}

kpollich commented 1 year ago

How do we detect the association between datastream <> mapping/pipeline to use?

The @package component template contains the default_pipeline settings, so this is considered the source of truth here. However, this wasn't always true historically, as we previously set default_pipeline on the index template.

Elasticsearch to offer a request parameter for the delete ingest pipeline API which prevents you from removing the ingest pipeline only if the pipeline is referenced by a currently written index. E.g. DELETE _ingest/pipeline/somepipeline?allow_delete_if_no_write_index=true.

I do like this idea as it would allow us to more safely clean up ingest pipelines while accommodating users who may have introduced custom ingest workflows built around package assets. The alternative is a lot of "tracking down" on Fleet's side by referencing _meta properties to determine what can and can't be safely cleaned up.

Do you know how we could the conversation started with the ES team about adding this functionality?

lucabelluccini commented 1 year ago

Hello @kpollich

For exploring the possibility of having DELETE _ingest/pipeline/somepipeline?allow_delete_if_no_write_index=true, I think the area is Elasticsearch Data Management (both for Data Stream & Ingest). The handle should be @elastic/es-data-management

I think the mapping between datastream <> mapping/pipeline can anyway be derived searching for the main component package. It is the best way to keep track of a user who cloned the "default" index template to customize the namespace. Even prior 8.6 (when the default_pipeline was moved to the package), the @package component template was anyway there. Taking the example of Nginx again, we can have:

metrics-nginx.stubstatus Index template
The user might have cloned it into metrics-nginx.stubstatus-prod and metrics-nginx.stubstatus-uat

Fleet can still know all indices associated to Nginx package are those using the metrics-nginx.stubstatus@package.

lucabelluccini commented 1 year ago

I have a follow up question. It can definitely happen to have 2 versions of the same integration running (or which could become misaligned):

I think we cannot blindly remove old assets if other policies are using other versions of the same package.

joshdover commented 1 year ago

This is just the version of the collection part of the integration - both versions will still ingest against the same data streams and use the default pipeline of those which only be the latest version. Integration ingest pipelines are supposed to be designed with backwards compatibility so that older versions of the collection code can ship data to the newer pipeline.

elastic / kibana

[Fleet] Ingest pipelines of old package version are deleted on upgrade #141347

Example