cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.03k stars 658 forks source link

BOSH deployment state is unresponsive agent after restart and sent meltdowns #2527

Closed max-soe closed 3 months ago

max-soe commented 4 months ago

Hey everyone,

Describe the bug From time to time we see meltdown alerts in our dashboards. They only occur during the updates of bosh and and then are sent out for nearly all deployments. But only for some of our our bosh directors and not in every update. After some research we think that this depends on the restart time. After a restart our dashboard shows that bosh report a lot of unresponsive agents. So our assumption here is that this is the default state for all instances if a new bosh director vm is deployed. Is this correct? If the health-monitoring job now is started fast enough, it getting the unresponsive agent state for more and more instances and then sent a meltdown alert. What i currently unsure is, is bosh sending this deployments into meltdown? Or is it only sending the alert?

Expected behavior Is it correct that instances have the unresponsive agent state after startup of bosh? And if we cannot change that, is it possible to start the health monitoring later to avoid such alerts?

Logs

[2024-06-04T00:30:29.407472 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: cf_cells_1 has instances with timed out agents
I, [2024-06-04T00:30:29.407576 #7]  INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: cf_cells_1 has instances with timed out agents
I, [2024-06-04T00:30:29.408078 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 1: Skipping resurrection for instances: diego-cell/e3caa226-8935-418e-bc53-3db307b1f814, diego-cell/032e2dcb-3e6e-411e-9a1d-7006806267df, diego-cell/6b1b6b96-3c53-4746-9681-2067da598840, diego-cell/f7905975-ea2f-420d-a47b-23435552df4c, diego-cell/0282622a-d2f8-4d65-bbde-ac39df3ece3c, diego-cell/b2f7f233-f886-4583-8dff-2eb466a88feb, diego-cell/4c0dc3ae-171b-41ec-a049-2a4e095a0c44, diego-cell/816db6bb-796b-4f90-8743-19ec728c7130, diego-cell/67e20eeb-307f-4a2d-8a85-94c38aebf126, diego-cell/720170ca-88dc-41fb-8491-7db0aeb7304c, diego-cell/582bb3f7-7213-4673-a007-352c86fc19c5, diego-cell/0add597e-9c98-4534-9ad2-671bdd97f61c, diego-cell/d9d93ed3-8265-4d73-abca-60ccd61859df, diego-cell/c39c8507-2f2f-4bd0-a85c-019030df59c0, diego-cell/4cd09447-d55b-4d75-afa5-f3b17a3ef458, diego-cell/c61afa15-dbe2-4697-bbf4-a8bb9cf67d28, diego-cell/82a3a13d-5af3-4ecf-b0bc-62f098be02a7, diego-cell/72df252a-ffe6-412a-8958-56949add3fc6, diego-cell/547618b0-e6c6-4e30-98d7-74f4943cccfa, diego-cell/70f0131b-e72f-44cd-93d5-5fd20e20d25b, diego-cell/812b2eb4-e68f-4554-8b34-b52c541bcefe, diego-cell/11b7745e-c255-4093-b954-2fba6fe11466, diego-cell/b13b8a80-f604-4ef4-b341-ac1f6d3c0ae0, diego-cell/c00353c7-4f55-4eb1-8fd3-c52200cdf1ea, diego-cell/e03033c2-a874-43ef-a40c-c3702efe1764, diego-cell/f678c22a-d619-464a-a332-667146eb8339, diego-cell/55e19f72-686e-44d5-b80d-068a2cb2e03d, diego-cell/3dd07ee3-3bc4-4b0e-87e5-d44a48caefdf, diego-cell/7f87ed0b-1b62-45f3-8253-82449572b815, diego-cell/758fccb8-f262-4fb9-841b-314992d6cc54, diego-cell/2f279282-7737-4d16-bcf8-6a7e29936317, diego-cell/f85b8b4a-2c6b-46f0-9373-008ecc431f1c, diego-cell/528511f9-d61e-4d73-8787-fd585d581281, diego-cell/331c7212-5918-40f3-ab90-b65fe8d38d0f, diego-cell/549b8107-51d8-4103-ad04-9ef0a75a4f94, diego-cell/ddc728fb-2946-4562-9223-4c69b7aed5a1, diego-cell/58437757-d2f6-431b-b9a4-5b0fb2e0a8d6, diego-cell/797038b6-a795-43fd-b720-5be2ed3d066f, diego-cell/b1ff9459-75d4-454c-a646-71e027033e7a, diego-cell/3cca60cb-5e04-4591-84f3-9f37c038b04e, diego-cell/70227045-83fd-455b-89ac-f386125fec41; deployment: 'cf_cells_1'; 41 of 132 agents are unhealthy (31.1%)
beyhan commented 4 months ago

@max-soe is there a bosh director version since which you observe this behaviour?

max-soe commented 4 months ago

No, as it is only a false-positive alert, nobody looked into that. But in the last month the number of alerts increased. My assumption is that we meanwhile use faster and faster vm types and with that this happens more often.

jpalermo commented 4 months ago

I think the health monitor logs all the VM NATS updates. So you should be able to reconstruct a history of what happened and how it ended up in this state. Are the updates not being received/processed by the health monitor? Is it a timing issue?

max-soe commented 4 months ago

Yes i have the full log, but it's to big for this thread... But i can summaries it.

  1. Start of health-monitoring after the bosh director is recreated:

    I, [2024-06-04T00:28:59.173218 #7]  INFO : Connected to NATS at 'nats://bosh.cf-eu10-003.internal:4222'
    I, [2024-06-04T00:28:59.173585 #7]  INFO : Logging delivery agent is running...
    I, [2024-06-04T00:28:59.173828 #7]  INFO : Event logger is running...
    I, [2024-06-04T00:28:59.174100 #7]  INFO : Resurrector is running...
    I, [2024-06-04T00:28:59.175149 #7]  INFO : HTTP server is starting on port 25923...
    I, [2024-06-04T00:28:59.201964 #7]  INFO : BOSH HealthMonitor 0.0.0 is running...
    I, [2024-06-04T00:28:59.202087 #7]  INFO : connection.graphite-connected
    I, [2024-06-04T00:28:59.761777 #7]  INFO : Resurrection config remains the same
  2. Found deployments and adding agents:

    I, [2024-06-04T00:28:59.802495 #7]  INFO : Found deployment 'app-autoscaler'
    I, [2024-06-04T00:28:59.881586 #7]  INFO : Adding agent e35a914e-233a-44c2-a01a-bb52bdd7faa3 (metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d) to app-autoscaler...
  3. Start analysing and reporting some alerts:

    I, [2024-06-04T00:30:29.179033 #7]  INFO : Analyzing agents...
    I, [2024-06-04T00:30:29.179254 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
    I, [2024-06-04T00:30:29.251761 #7]  INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: e35a914e-233a-44c2-a01a-bb52bdd7faa3 has timed out
  4. Meltdown alerts:

    I, [2024-06-04T00:30:29.256503 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
    I, [2024-06-04T00:30:29.256565 #7]  INFO : (Event logger) notifying director about event: Alert @ 2024-06-04 00:30:29 UTC, severity 2: app-autoscaler has instances with timed out agents
    I, [2024-06-04T00:30:29.308410 #7]  INFO : [ALERT] Alert @ 2024-06-04 00:30:29 UTC, severity 1: Skipping resurrection for instances: metricsforwarder/0d04ccc0-7a15-4e9d-9ec1-0f9c6eb7897d, metricsforwarder/fb8e7a9b-d7d0-47cd-bfd4-f1e59b880f35, metricsserver/fe5527ab-57fa-4c24-a457-acebe8e65edb, metricsgateway/bd0b06a0-180e-4a1b-8d3e-0588c5eeefe7, operator/d199c4db-5d05-456b-91c0-ed634d40bd07, operator/f40b3de1-6e01-4e8a-8a42-ab99336536f1, eventgenerator/7bb09c5b-a81f-47ac-bc3c-57ce20e25c54, scalingengine/dda1e1f7-3290-43eb-b605-1e040614966f; deployment: 'app-autoscaler'; 8 of 23 agents are unhealthy (34.8%)

If the issue not occur, the reports are similar except the meltdown alerts. We getting timeout alerts on all landscapes, but they not hit everytime the threshold. As the other deployments like here autoscaler are not updated, i think this is a timing issue. But how can we fix that to avoid this false positives?

jpalermo commented 3 months ago

Working backward. To get the meltdown alert, the agents need to be timed out.

Agent objects, when they are created, are seeded with the current time as their updated_at value. Which is updated to the current time any time the agent heartbeats.

So why didn't the health monitor get a heartbeat from the agent before it was timed out? You can check the health monitor logs to see when it got heartbeats from the agent. Agents send heartbeats every 30 seconds, so you should be able to work forward or backward from other heartbeats in the logs to see what the health monitor was doing when it should have received it.

Might also check the agent logs in this situation to see if there was any sort of error when sending the heartbeat.

max-soe commented 3 months ago

We cannot reproduce this issue, after we migrated from the bosh version 280.0.20 to 280.0.25. Maybe this was fixed by the metric server improvements. I will close this ticket for now.