elastic / elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Other
133 stars 142 forks source link

Possible to not ACK upgrade action #760

Closed blakerouse closed 2 years ago

blakerouse commented 2 years ago

Seems it is possible that an upgrade action can result in the Elastic Agent successfully upgrading, but never fully ACKing the action.

Seems to happen when an Elastic Agent is having connectivity issues with Fleet Server on successful upgrade. I have seen this issue report in an SDH and worked around the issue, but we need to track down how and why this happens. The upgrade process needs to be full proof and this seems to be an issue that needs to be tracked down and fixed.

For confirmed bugs, please report:

Known causes:

joshdover commented 2 years ago

Agreed we should identify the root cause. I also opened this PR which could serve as a stop-gap to mark the agent as upgraded when we see that the version number changed: https://github.com/elastic/fleet-server/pull/1731

joshdover commented 2 years ago

@scunningham and I took a look at this today and produced this summary of how Elastic Agent and Horde's implementation of action acking work as well as some some potential Fleet Server issues.

There's potentially an issue where Fleet Server:

joshdover commented 2 years ago

@blakerouse I think this is where could use the most help from you right now. I'm working with @pjbertels to get a cluster where this has been reproduced in a scale test with debug logs enbled. In those cases we've been seeing agents POST'ing a successful 200 ack, but the ES document isn't really updated correctly.

joshdover commented 2 years ago

@blakerouse does it make sense to close this issue now or do we suspect there are other cuases basied https://github.com/elastic/fleet-server/issues/1894?

blakerouse commented 2 years ago

@joshdover I am looking to keep this open until I can get all the 75k agents to upgrade successfully in scale testing. Once it has been confirmed it is completely fixed then I will close this.

I don't want to take an early victory just yet. Other than adding the version conflict retries and the proper return of errors, no other fix has been implemented. The change might expose other errors that need to be handled/fixed.

pjbertels commented 2 years ago

Had agents stuck in updating again in latest 8.5 SNAPSHOT 07ad0494. Upgraded 10021 and 849 were stuck in updating. All 10021 had the upgraded version of 8.2.1.

[16:08:59]           trend=online is ascending                                                                                                                                     perf_lib.py:164
                     status=FleetAgentStatus(total=10021, inactive=10021, online=9172, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                         
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
                     prev_status=FleetAgentStatus(total=9459, inactive=10021, online=8610, error=0, offline=0, updating=849, other=10870, events=0, doc_id=None, run_id=None,                     
                     timestamp=None, kuery='local_metadata.elastic.agent.version : 8.2.1', cluster_name=None)                                                                                     
[16:14:20] ERROR    poll_agents_status reached max_iterations 
blakerouse commented 2 years ago

The snapshot used in @pjbertels previous comment did not include the fix in https://github.com/elastic/fleet-server/issues/1894.

The latest snapshot build does contain the fix, but BC2 and snapshot for 8.5 is current broken because of https://github.com/elastic/ingest-dev/issues/1282. This is preventing the ability to actually test the fix performed in https://github.com/elastic/fleet-server/issues/1894.

Hopefully a quick fix to https://github.com/elastic/ingest-dev/issues/1282 is near, so we can get back to testing upgrade.

blakerouse commented 2 years ago

@pjbertels Was able to successfully get latest snapshot running with 8.5 and perform an upgrade. Resulted in ~5k still being stuck in upgrading. Digging into the logs now.

blakerouse commented 2 years ago

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

The issue is that update request to elasticsearch to update the document seems to be only updating 1 of the fields.

    now := time.Now().UTC().Format(time.RFC3339)
    doc := bulk.UpdateFields{
        dl.FieldUpgradeStartedAt: nil,
        dl.FieldUpgradedAt:       now,
        dl.FieldUpgradeStatus:    "completed",
    }

    body, err := doc.Marshal()
    if err != nil {
        return errors.Wrap(err, "handleUpgrade marshal")
    }

    if err = ack.bulk.Update(ctx, dl.FleetAgents, agent.Id, body, bulk.WithRefresh(), bulk.WithRetryOnConflict(3)); err != nil {
        return errors.Wrap(err, "handleUpgrade update")
    }

Only the upgraded_at field is set (I have correlated the log message for the ACK to match the timestamp that gets set in this field). The other 2 fields upgrade_started_at and upgrade_status are not updated on the document.

At the moment I am in need of help from elasticsearch team, because I believe this could be an issue internal to elasticsearch.

jlind23 commented 2 years ago

Hey @dakrone we definitely need Elasticsearch team help here - If not you, would you please help us find someone to help on the issue above ^ which seems to potentially be an Elasticsearch bug

dakrone commented 2 years ago

I'm having a hard time parsing exactly what this is doing (being not at all familiar with agent's internals), from what I can guess, it's doing something like this:

PUT /myindex/_doc/1
{
  "foo":"bar"
}

POST /myindex/_update/1
{
  "doc": {
    "upgrade_started_at": null,
    "upgraded_at": 1234,
    "upgrade_status": "completed"
  }
}

GET /myindex/_doc/1

Which works fine for Elasticsearch.

If that's incorrect, can someone translate this into an Elasticsearch reproduction so I can try this out?

joshdover commented 2 years ago

I was unable to find any errors being reported by Fleet Server. From the logs all upgrade actions where dispatched to the Elastic Agent's and all upgrade actions where ack'd.

What's the confidence level that we're not getting any errors returned by ES? My understanding is that we may still be dropping logs in Cloud, but that may not matter to prove this depending on the logging.

+1 a simple repro would be helpful. Could we use two tight loops of concurrent updates similar to what is happening with upgrade acks and the checkin 30s heartbeat to emulate this?

blakerouse commented 2 years ago

@dakrone That is not what Fleet Server is doing. It is using _bulk to perform actions in bulk through its bulker system. So what actually would be happening is the following:

POST _bulk
{ "update" : {"_id" : "023e83e4-e3c7-46f6-9424-adb03e7ba980", "retry_on_conflict": "3", "_index" : ".fleet-agents"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

This bulk update action can occur with any other action that needs to occur with Fleet Server at an interval of 250ms or when the bulk buffer is full.

I am seeing that it works sometimes to update all the fields and in other cases it is not updating all fields.

dakrone commented 2 years ago

Okay, even doing something like this, which creates and then updates the document:

DELETE /myindex

GET /myindex/_doc/1

POST _bulk
{ "index" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "upgraded_at": 1234 }
{ "update" : {"_id" : "2", "retry_on_conflict": "3", "_index" : "myindex"} }
{ "doc" : {"upgrade_started_at": null,"upgraded_at": 1234,"upgrade_status": "completed"} }

GET /myindex/_doc/2

Succeeds locally. Perhaps a reproduction with the tight loops would be necessary to see if this can be reproduced (if it can, please open an issue in the Elasticsearch repository).

nicpenning commented 2 years ago

Hey y'all, I stumbled across this issue because today in production I have 1 agent that is stuck "Updating". (It has been like this for a little while. This device likes to roam on and off the network very frequently)

Rebooting, stopping/starting service or assigning to other policies will not get the status away from Updating.

Furthermore, the Agent is at 8.4.0 (Kibana and .\elastic-agent.exe status say so) and I can't upgrade it because it is grayed out in some spots and in others where it tries to let me upgrade fails immediately. So I can't tell the health of this agent via the UI, however, the .\elastic-agent.exe status reveals everything is healthy.

image

image

image

image

image

I will be happy to share anything that might help the situation. I am sure removing and re-enrolling the agent will do the trick but I very much do not like doing that to fix these issues.

nicpenning commented 2 years ago

Additional detail: 8.4.1 on-premise stack, virtualized on modern Windows Server OS.

nicpenning commented 2 years ago

Further info:

When doing a search on the specific endpoint to .fleet-agent here is the redacted result:

{
        "_index": ".fleet-agents-7",
        "_id": "redacted-b019-4c98-8e99-d99b5a76cb93",
        "_score": 5.936952,
        "_source": {
          "access_api_key_id": "redacted",
          "action_seq_no": [
            623
          ],
          "active": true,
          "agent": {
            "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
            "version": "8.4.0"
          },
          "enrolled_at": "2022-08-19T20:37:19Z",
          "local_metadata": {
            "elastic": {
              "agent": {
                "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
                "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
                "log_level": "info",
                "snapshot": false,
                "upgradeable": true,
                "version": "8.4.0"
              }
            },
            "host": {
              "architecture": "x86_64",
              "hostname": "redacted",
              "id": "redacted-5f69-411e-b954-e603dc586f26",
              "ip": [
                "169.254.53.207/16",
                "172.31.32.1/20",
                "169.254.73.42/16",
                "169.254.124.209/16",
                "192.168.209.1/24",
                "192.168.30.1/24",
                "redacted/25",
                "::1/128",
                "127.0.0.1/8",
                "172.25.240.1/20",
                "172.27.32.1/20",
                "172.29.208.1/20"
              ],
              "mac": [
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted",
                "redacted"
              ],
              "name": "redacted"
            },
            "os": {
              "family": "windows",
              "full": "Windows 10 Enterprise(10.0)",
              "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
              "name": "Windows 10 Enterprise",
              "platform": "windows",
              "version": "10.0"
            }
          },
          "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
          "type": "PERMANENT",
          "default_api_key_id": "redacted",
          "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
          "default_api_key": "redacted:redacted",
          "updated_at": "2022-09-28T13:17:42Z",
          "last_checkin_status": "online",
          "last_checkin": "2022-09-28T13:17:40Z",
          "policy_revision_idx": 31,
          "policy_coordinator_idx": 1,
          "upgrade_started_at": "2022-09-02T14:45:21.497Z",
          "upgraded_at": null
        }

Two things in this output stick out to me (however, I have no clue how these should read). The first is that the device is Windows 11 but the data shows Windows 10. The second, which I think might be the issue, is that the upgraded_at is set to null, even though it started the upgrade on 9/2. Perhaps the null in the upgraded_at is why it is shown stuck at updating?

Please me know if this is helpful info or just annoying and I can buzz off :)

joshdover commented 2 years ago

Hi there @nicpenning, you do indeed seem to be encountering this same issue. I recommend trying this as a workaround, you can run this from Dev Tools in Kibana:

POST /api/fleet/agents/<id>/upgrade
{"force": true, "version": "8.4.1"}
nicpenning commented 2 years ago

So, I went ahead and updated the upgraded_at to 2022-09-02T14:55:21.497Z and it fixed the issue of showing updated.

POST .fleet-agents/_update/3e9b94cb-b019-4c98-8e99-d99b5a76cb93
{
  "doc": {
    "access_api_key_id": "redacted",
    "action_seq_no": [
      623
    ],
    "active": true,
    "agent": {
      "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
      "version": "8.4.0"
    },
    "enrolled_at": "2022-08-19T20:37:19Z",
    "local_metadata": {
      "elastic": {
        "agent": {
          "build.original": "8.4.0 (build: 56a002dc5c96d3a79decbf28bb7994cf49f3a4da at 2022-08-19 22:35:28 +0000 UTC)",
          "id": "redacted-b019-4c98-8e99-d99b5a76cb93",
          "log_level": "info",
          "snapshot": false,
          "upgradeable": true,
          "version": "8.4.0"
        }
      },
      "host": {
        "architecture": "x86_64",
        "hostname": "redacted",
        "id": "redacted-5f69-411e-b954-e603dc586f26",
        "ip": [
          "169.254.53.207/16",
          "172.31.32.1/20",
          "169.254.73.42/16",
          "169.254.124.209/16",
          "192.168.209.1/24",
          "192.168.30.1/24",
          "redacted",
          "::1/128",
          "127.0.0.1/8",
          "172.25.240.1/20",
          "172.27.32.1/20",
          "172.29.208.1/20"
        ],
        "mac": [
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted",
          "redacted"
        ],
        "name": "redacted"
      },
      "os": {
        "family": "windows",
        "full": "Windows 10 Enterprise(10.0)",
        "kernel": "10.0.22000.978 (WinBuild.160101.0800)",
        "name": "Windows 10 Enterprise",
        "platform": "windows",
        "version": "10.0"
      }
    },
    "policy_id": "eb2a4e10-e2f3-11ea-8e88-e52736baa840",
    "type": "PERMANENT",
    "default_api_key_id": "redacted",
    "policy_output_permissions_hash": "5af4296cdba1c61fb98221f7996c8f80f6745cf00d521d8c3ecada41cefcc6c2",
    "default_api_key": "redacted:redacted",
    "updated_at": "2022-09-28T13:23:32Z",
    "last_checkin_status": "online",
    "last_checkin": "2022-09-28T13:23:31Z",
    "policy_revision_idx": 31,
    "policy_coordinator_idx": 1,
    "upgrade_started_at": "2022-09-02T14:45:21.497Z",
    "upgraded_at": "2022-09-02T14:55:21.497Z"
  }
}

image

I will try the upgrade again.

nicpenning commented 2 years ago

I got the Upgrading Agent notification after I pushed the upgrade so that single field seemed to have been the culprit.

Furthermore, the upgrade went on to work. Not sure if any of this info helps your case but figured I would share. You have helped me take a look under the hood to see what could be broken and a simple field update was a work around. Hopefully this will be addressed to help improve the reliability of such awesome capabilities.

image

jlind23 commented 2 years ago

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - https://github.com/elastic/kibana/issues/139704

joshdover commented 2 years ago

@joshdover @nicpenning then I assume that the following issue might resolve the UI display - elastic/kibana#139704

No I don't think that will fix the issue. The issue is that Fleet Server (or maybe ES) is losing the writes to make the upgrade as completed. This is why updating the upgraded_at field makes the agent healthy again in the UI.

We still need to get to the bottom of where and why these updates are not being persisted in Elasticsearch when the agent ack's it's upgrade.

amitkanfer commented 2 years ago

@nicpenning @blakerouse , i had a short slack conversation with @tomcallahan and he suggested to enable audit logging in ES while this is happening. Did we try that?

dakrone commented 2 years ago

Better than audit logging would be the indexing slowlog: https://www.elastic.co/guide/en/elasticsearch/reference/8.4/index-modules-slowlog.html#index-slow-log

PUT /<your-index>/_settings
{
  "index.indexing.slowlog.threshold.index.warn": "0ms",
  "index.indexing.slowlog.source": true
}
joshdover commented 2 years ago

@pjbertels Could we try running that command on the .fleet-agents index before you run your next scale test? If you run into any privilege issues, let me know, we may need to create a dedicated user with system index access.

blakerouse commented 2 years ago

I spent the last 2 days build and working on a test to create 75k documents into the .fleet-agents index, simulate the upgraded ACK write into elasticsearch using the Fleet Server bulker while also running the checkin monitor for the same 75k agent documents. I tested with a local elasticsearch, as well as a multi-node elasticsearch in the cloud, using the .fleet-agents index directly to ensure that it got the same shards and replicas that are ran in production.

With all of that built and many, many runs of the test I could never get it to fail. The agent documents where always updated correctly with no issue, even with checkin running at the same time. Even with adjustments of intervals to very low amounts to increase the write speed I still never hit an issue.

While I was doing this work @pjbertels was doing more runs and keeping me up to date on the result. What was notices is that with the change made in https://github.com/elastic/fleet-server/issues/1894 we get a lot less agents that fail to get there upgrade status correctly. Most recently it was only 145 out of the 75k, which is a large improvement from the 5k we used to get. That change did have an affect but we still have another issue that cannot be found in Fleet Server.

Looking at the agent documents latest deployment that @pjbertels had done I did notice something very interesting. The upgrade_status: complete was now being set, before it would be stuck in upgrade_status: started. I was able to identify the reason this is now working it is because of this change here - https://github.com/elastic/kibana/pull/141957

That change removed the writing of upgrade_status: started by Kibana, that means that Fleet Server set the value and it stayed. With that change it is now obvious what the issue is, it's that Kibana is overwriting the document that Fleet Server is writing those values into.

The issue is not Fleet Server, the issue is Kibana. The order in which it writes to the .fleet-agents index and the .fleet-actions index is incorrect. Looking at the code it is clear that Kibana writes the action into the .fleet-actions first, before updating all the .fleet-agents documents with the upgrade_stated_at: ${timestamp} and upgraded_at: null. Because of this ordering it means that the Fleet Server will start sending that action as soon as its written, it is so fast that it sends the actions and it is quickly ACK'd where Fleet Server sets the fields, but then Kibana comes back after and overwrites the fields as if the action has not even started, but it had already completed.

https://github.com/elastic/kibana/blob/1289226cac76b55eaf839a502f4588c8790e190b/x-pack/plugins/fleet/server/services/agents/upgrade_action_runner.ts#L43

The upgradeBatch function is the culprit, you can see that createAgentAction comes before bulkUpdateAgents. It should be the other way around that way the agent document is updated before it is even possible that the action be sent by Fleet Server.

joshdover commented 2 years ago

Big thanks, Blake.

@juliaElastic @nchaulet This sounds the last bit to fix should be pretty low effort on the Kibana side to create the actions after updating the agents.

Makes sense this would be more likely to happen in scale tests too since in that case horde isn't even trying to download the artifact, it just does a HEAD to make sure it exists and then acks, so it can be really fast to return.

juliaElastic commented 2 years ago

Good catch! This pr is changing the order: https://github.com/elastic/kibana/pull/142299

joshdover commented 2 years ago

In summary this issue will be resolved by:

joshdover commented 2 years ago

@juliaElastic It'd be good to get an isolated backport of that change to 8.4.x too in case there's another patch release

scunningham commented 2 years ago

Interesting things to think about: 1) What if a customer tries to upgrade the same set of agents at the same time from two different Kibanas. 2) What if Kibana crashes after setting some or all of the status fields, but before rolling out the action.

blakerouse commented 2 years ago

Going to keep this open until we get confirmation that all 75k upgrade during scale testing.

joshdover commented 2 years ago
  1. What if a customer tries to upgrade the same set of agents at the same time from two different Kibanas.

  2. What if Kibana crashes after setting some or all of the status fields, but before rolling out the action.

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index. Ideally, it's only a single Fleet Server instance writing to to each agent document at a time, while the agent is connected to it. Kibana could interact with the agents via pushing new documents to the actions index which Fleet Server then consumes and uses to update the agent state.

We would need a way to easily query agents based on their pending, delivered, and ack'd actions so we could determine if an agent has a pending upgrade and dispaly this in the UI.

This wouldn't completely solve the problem though if you consider actions like unenrollment and policy reassignments. There are still potentially multiple actors that would want to update a single .fleet-agents document at the same time.

juliaElastic commented 2 years ago

It'd be good to get an isolated backport of that change to 8.4.x too in case there's another patch release

Raised https://github.com/elastic/kibana/pull/142364

AndersonQ commented 2 years ago

We may need to move away from a model where both Kibana and Fleet Server are writing to the .fleet-agents index.

I agree 100%, please let's avoid more than one application writing to the same document.

amitkanfer commented 2 years ago

please let's avoid more than one application writing to the same document.

+1

joshdover commented 2 years ago

FYI I'm working on a draft proposal that will encompass addressing these issues

pjbertels commented 2 years ago

Tests at 50K and 75K confirm we no longer have agents stuck in 'updating' at the end of an upgrade. Tested with image_tag:8.5.0-58624073-SNAPSHOT.

joshdover commented 2 years ago

Great news. @blakerouse I think we should go ahead and backport https://github.com/elastic/fleet-server/pull/1896 to 8.4 in case we have a an additional 8.4 release

amitkanfer commented 2 years ago

@nicpenning can you please email me at kanfer@elastic.co ?

nicpenning commented 2 years ago

FYI: Upgrading from 8.4.1 to 8.4.3 - We had 1 asset that got stuck updating and had a null value for upgraded_at out of the 15 we upgraded.

Will this be resolved in 8.5.0 or was it supposed to be resolved in 8.4.3?

nicpenning commented 2 years ago

POST /api/fleet/agents/<id>/upgrade {"force": true, "version": "8.4.1"}

I tried this except for using 8.4.3 but I got back no handler found error: { "error": "no handler found for uri [/api/fleet/agents/redacted/upgrade?pretty=true] and method [POST]" }

jlind23 commented 2 years ago

@nicpenning this will be resolved in 8.5.0 as it was not backported to 8.4.3

dmgeurts commented 1 year ago

I am still having the same issue upgrading agents from 8.6.0 to 8.6.1. Initially, most agents were stuck in updating, then after a policy update most returned healthy, but some are still stubbornly stating updating even though the version listed is 8.6.1.

nicpenning commented 1 year ago

Confirmed we saw this in 8.6.0 -> 8.6.1.

2 out of 24 agents successfully upgraded, all others were "stuck" updating. I will note that this only occurred on the scheduled upgrades by setting it to an hour into the future and setting the time frame for over an hour. Any upgrades set to execute immediately worked without any hesitation.

In the end, we had to force the upgrades via the API to get upgraded. I believe then that there is an issue with the scheduled aspect of the upgrades.

dmgeurts commented 1 year ago

I updated the agents manually from the CLI, but that didn't resolve the updating status either. I'm not sure how to reset the status now, the agents appear happy.

How did you force things via the API?

nicpenning commented 1 year ago

https://github.com/elastic/kibana/issues/135539#issue-1290026809

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }
dmgeurts commented 1 year ago

elastic/kibana#135539 (comment)

POST kbn:/api/fleet/agents/<agent id>/upgrade
{ "force": true, "version": "8.6.1" }

This changed the status to unhealthy until I restarted the agent and then it went back to updating. I should stress that our agents are already upgraded to 8.6.1.

Agent log:

12:55:52.674 elastic_agent [elastic_agent][warn] Skipping addition to action-queue, issue gathering start time from action id af1e004c-cf91-4de1-96ed-477d74c4cc53: action has no start time
12:55:52.689 elastic_agent [elastic_agent][info] Upgrading agent
12:55:52.689 elastic_agent [elastic_agent][error] Unable to clean downloads before update
12:55:55.698 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz completed in 2 seconds @ 191.9MBps
12:55:55.705 elastic_agent [elastic_agent][info] download from https://internal.server.com/elk/beats/elastic-agent/elastic-agent-8.6.1-linux-x86_64.tar.gz.sha512 completed in Less than a second @ +InfYBps
12:55:57.770 elastic_agent [elastic_agent][error] Failed to unpack upgrade artifact
Showing entries until Feb 3, 12:55:57
nicpenning commented 1 year ago

Okay, the difference with my agents is that they were still reporting 8.5.3 and stuck updating. After the force upgrade with API they upgraded with no issues and got out of the Updating state.

dmgeurts commented 1 year ago

I probably should have been a bit more patient, and not jumped the gun on manually updating the agents.

For the agents reporting to the fleet server I can unenroll and redo the installation, but having to do this for an increasing number of agents each time we upgrade is far from ideal. And as for the Fleet server itself, this is not so easy, it's also stuck in updating.