Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption

concourse / hush-house

Concourse k8s-based environment

https://hush-house.pivotal.io

30 stars 23 forks source link

Investigate worker recovery for k8s ungracefully restarting pods out-of-band due to high memory consumption #94

Open deniseyu opened 4 years ago

deniseyu commented 4 years ago

We have observed on Hush House that sometimes, a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior

Concourse doesn't know about this restart, and when the pod is recreated, it re-registers itself using the same name. However, when k8s recreates a pod in this way, it deletes the volumes; Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC. The symptoms of this are that Hush House users will see error messages like this:

volume '7615ac07-7c88-4555-6888-9e87f49317bf' disappeared from worker 'workers-worker-10'

Short term pain reduction

In the short term, we should look at emitting more descriptive errors and adding some operational guidelines and documentation around how to manage this problem. We believe that restarting pods using kubectl won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:

fly -t <target> land-worker -w <worker-name>
fly -t <target> prune-worker -w <worker-name>

Running prune will remove it from the ATC DB.

Long term fixes?

We could look into modifying the hard eviction limits or seeing if there's a way we can add a lifecycle hook to fire off a retire-worker request or something to ATC, to try to keep ATC DB and the k8s cluster synced up.

We can also look into changing the behaviour of the init container so that volumes are not totally destroyed when restarts happen. ~We really only want this to be true if a worker is ephemeral, but in the case of long-lived workers, it would be worthwhile to see if we can relocate and reattach those volumes when a pod comes back.~ spoke with @kcmannem - it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back

Note: We are using anti-affinity so we only have one pod per node, but it looks like if that one pod's memory is too high it still gets "evicted".

xtreme-sameer-vohra commented 4 years ago

Experienced this again week of Dec 16 - Dec 20th ( primarily on Dec 17th and 20th )

Here are some screenshots with what we observed

Dec 17th
Dec 20th

Note the workers with 0 containers, these had been restarted/recreated by K8s and while they were successfully registered, they were not in a usable state.

Interesting ones to note was that one node entered a ReadOnly Filesystem mode. We had to manually kill the node via the gcloud console

cirocosta commented 4 years ago

Hey,

With regards to the following:

Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC.

do you remember if you've seen the error happening for long periods of time, and never recovering?

it turns out that we do have the proper diffing in place during the flow of reporting volumes and getting them marked as missing or not.

e.g., during the ReportWorkerVolumes (iniated from the worker through `volumesweeper) flow:

    i. `missing_since` gets updated during the ReportWorkerVolumes flow

            this is supposed to account for the case where the database has volumes
            that the worker doesn't

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).UpdateVolumesMissingSince
                            /src/atc/db/volume_repository.go:125 +0x9d5
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:65 +0xc68
                    net/http.HandlerFunc.ServeHTTP
                            /usr/local/go/src/net/http/server.go:2007 +0x44

            on the db, we can verify that being set:

                    concourse=# select handle, worker_name, missing_since from volumes;

                                    handle                | worker_name |         missing_since
                    --------------------------------------+-------------+-------------------------------
                     2a2cfbb8-ae97-4ada-4e80-6d8d64031cb7 | worker      | 2019-12-24 16:00:52.270141+00
                     89503487-3816-4bac-7b3c-b7109cc8b946 | worker      | 2019-12-24 16:00:52.270141+00
                    (2 rows)

    ii. volumes not known by the db get transitioned to `destroying`

            on a later tick, `destroy`ing volumes get actually destroyed by
            the volume sweeper (in the worker itself)

                    github.com/concourse/concourse/atc/db.diff
                            /src/atc/db/container_repository.go:37 +0x4e
                    github.com/concourse/concourse/atc/db.(*volumeRepository).DestroyUnknownVolumes
                            /src/atc/db/volume_repository.go:577 +0x3d6
                    github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
                            /src/atc/api/volumeserver/report.go:47 +0x859

later, during gc:

    i. volumes with `missing_since > threshold` get removed
            (literally, `DELETE`d)

ps.: by default, that threshold is 5min.

This means that as long as the worker gets up, after 5m (or whatever time is configured in the installation), we should have the references being cleaned up.

To see this in practice, change docker-compose.yml to have a fixed-name and use a non-persistent directory for its volumes:

diff --git a/docker-compose.yml b/docker-compose.yml
index 400598ee0..cf8e83957 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -41,12 +41,10 @@ services:
     - 7788:7788
     stop_signal: SIGUSR2
     environment:
+      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_BAGGAGECLAIM_DRIVER: naive
+      CONCOURSE_BIND_IP: 0.0.0.0
       CONCOURSE_LOG_LEVEL: debug
       CONCOURSE_TSA_HOST: web:2222
-
-      # avoid using loopbacks
-      CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
-
-      # so we can reach Garden/Baggageclaim for debugging
-      CONCOURSE_BIND_IP: 0.0.0.0
-      CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+      CONCOURSE_WORK_DIR: /tmp
+      CONCOURSE_NAME: worker

then run task. this will incur in volumes being created.

now, ungracefully remove the worker (docker rm -f concourse_worker_1), and then re-create it (docker-compose up -d worker would do it).

this will bring the worker back with a mismatch between its state (no volumes in the worker) and the db (volumes marked as being in the worker).

if you try to run the task again, that exact error will pop up, but after few minutes, web will get rid of the extra volumes in the db.

cirocosta commented 4 years ago

it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back

Yeah, that's because prior to [gc containers and vols present in the worker but not on the db], the worker would come back with a set of volumes that would differ what the db had, and that'd cause problems (as we'd never gc vols that were in the worker but not in the db), thus, they'd accumulate forever, but it seems like given that now we do have better logic for dealing with the differences, we could think about not having the removal anymore (as the system would go towards the right thing in any case).

Some more context on the shutdown behavior

During a graceful shutdown, the worker retires:

the preStop has a kill -s SIGUSR2 $concourse which leads to retire, which means that in the db, the vols will go away, but not right away in the worker - as that takes few ticks (it'll first become orphan, then marked as for destruction, then destroyed by the worker - AFAIR)

by always removing them all during initialization, we can ensure that they would not forever persist in that pvc

e.g.:

    worker-1
        vols: 1,2,3,4   (in the pvc: `vols/{1,2,3,4}`)

kubectl delete pod worker-1

t0      ==> kill -s SIGUSR2 worker-1 (preStop)
t1      ==> wait for the proc to die ...

t2  ==> worker-1 retires
t3      ==> container goes away
t4      ==> pvc stays around
            vols/{1,2,3,4}

t5  worker-1 gets scheduled to antoher node ...
t6      pvc-worker-1 gets attached to worker-1
            vols/{1,2,3,4} are there

t7  worker-1's initContainer gets started
        ==> cleans `vols/` from the pvc

t8  worker-1 container starts
        ==> state matches the `db` (no vols here, no vols there)

but then, if there's an ungraceful shutdown (anything that does not go through preStop), then the db will not be cleaned up w/ regards to that worker (as there'll never exist a kill -s SIGUSR2 $concourse that makes it go through the retire process), and if it turns out to be a whole-pod shutdown, then the initContainer will indeed remove the contents, whick will cause the mismatch:

    worker-1
        vols: 1,2,3,4   (in the pvc: `vols/{1,2,3,4}`)

    t0  ==> destroy

As I've mentioned in the previous comment, the mismatch should be fine - on few ticks, we'd see Concourse dealing with those volumes not existing in the worker by marking them with missing_since, and later, having them deleted.

cirocosta commented 4 years ago

We believe that restarting pods using kubectl won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:

oh, actually, a kubectl delete pod would end up putting the worker through "the right termination process" which is going through preStop, which takes care of doing the retire-worker before termination, which would've cleaned up the database-side first, and then on re-initialization, cleaning up the pvc (which would make vols_in_db=0 and vols_in_the_worker=0)

We could look into modifying the hard eviction limits

hmmm I don't think that's really the answer as that's something that's up to the consumers of the Chart; even if we did remove those limits that we have set under resources, it'd not really take the problem away - the kubelet could still do a hard eviction on node-configured values that we don't know about as pod spec authors (aside from any other form of ungraceful termination that will eventually happen).

e.g.:

kubectl proxy --port=8001
curl -sSL "http://localhost:8001/api/v1/nodes/gke-hush-house-workers-3-687beb5a-3fgz/proxy/configz" | jq '.kubeletconfig.evictionHard'
{
  "memory.available": "100Mi",
  "nodefs.available": "10%",
  "nodefs.inodesFree": "5%",
  "pid.available": "10%"
}

cirocosta commented 4 years ago

tl;dr

Having both remount volumes on restart and gc containers and vols present in the worker but not on the db, it seems sensible to me to consider going the route of not removing the volumes on initialization in any case.

(this can be activated in the chart via --set worker.cleanUpWorkDirOnStart=false)

Given that we now diff on both ways, and have the volumes properly mounted when getting up (in any case), I think we're safe to go without initial removals - Concourse will take care of removing any leftovers on either sides.

ps.: while I'm aware of the fixes for doing the proper mounting for overlay, I'm not entirely sure if everything would still be alright for btrfs in the case where loopback devices are used (although I think it's alright already)

vito commented 4 years ago

@cirocosta I think we might have about this previously, but my memory is failing me: workers coming back under the same goes against the intent of the current worker lifecycle design, where the name is supposed to directly correspond to the lifecycle of both the processes and data (i.e. containers and volumes) managed by the worker. If they go away, the worker should just get a new name. These kinds of problems seem to be caused by not following that rule.

What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true? With the chart becoming the preferred method to deploy Concourse, it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'. Sorry if this is reviving an old discussion, since I think it is. :sweat_smile: If you remember where we chatted about this previously feel free to just link it.

cirocosta commented 4 years ago

heey @vito,

These kinds of problems seem to be caused by not following that rule.

ooh, totally!

it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'.

my impression is that with the way that nowadays we perform the diffs in both ways, we're actually very close (if not already fine) to handle state persisting around 🤔 (I might be very wrong though lol)

What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true?

We've been using non-ephemeral names just because of the use of StatefulSets for our workers (see worker-statefulset.yaml):

`StatefulSets` are valuable for applications that require one or more of the
following:

- **Stable, unique network identifiers**.
- **Stable, persistent storage**.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.

(from Using StatefulSets)

As you mentioned, it diverges from what we want in terms of both the stable naming across pod starts¹, and having the storage trully persistent (regardless of where the pod goes, the volume + its contents goes together with it) - properties of StatefulSets.

But, still, it gives us something that's very valuable: the ability to have templated persistent volume claims. This is what allows each pod to get a persistent volume attached to it. Without that, we'd end up consuming whatever storage is set up for ephemeral storage in that machine (in the case of GKE, e.g., that'd be the same storage used for the whole machine, including the storage for container images, other workloads, etc)².

Leveraging StatefulSets then, we can declare that we want our workers to come with a 1TB disk and have that disk just for the worker pods, regardless of whatever is used for the kubelet and other workloads in the machine (i.e., dedicated IOPS for the worker). With a Deployment, that's not possible.

¹: we never tweaked CONCOURSE_NAME in the chart in any form, it's using the hostname that gets generated by default ($statefulset-$index), but that could be changed, for sure.

²: that's partially true - it's possible to get a deployment bind-mounting a local disk that's separate from the regular disk used to the kubelet and others, you'd only need to request that extra local disk and make sure you're the only one bind-mounting it, having prior-knowledge of what the device name is, and also having it properly formatted, etc.

cirocosta commented 4 years ago

From what I understand, the best case would be to have PVCs (so that the worker pod can have all the IOPS and storage to itself, regardless of where it lands¹), with pods being ephemeral, and the PVCs' lifecycle bound to the pods lifecycle (when pods go away, PVCs would go away too, so that we don't keep any state around for pods that were terminated). i.e., a blend of properties from Deployments and StatefulSets.

With the way that kubernetes is extensible (e.g., see operator pattern), that's something that we could persue - have a CRD like "EphemeralStatefulSet" and then manually control the lifecycle of pods and pvcs in a controller. should we? I don't think so :sweat_smile:

While I agree that by making Concourse very smart we can have all sort of edge cases and problems to deal with, shooting ourselves in the foot, I think this kind of resiliency that we're talking about here is worth having baked in. Even in those cases of ephemeral names + ephemeral disk, we can get things like a container restart (where the pod does not get recreated and the container comes up faster than web would notice a stall) where we'd need to fix the state; or have scenarios where we can't even have ephemeral: true given the nature of unreliable networks.

¹: the IOPS part is partially true - depending on vCPU, the IOPS can be different

damn, sorry for those walls of text 😬

xtreme-sameer-vohra commented 4 years ago

It does appear that based on our GC logic that volumes in the DB but not on the worker should be cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick ).

Our hypothesis that the workers were in a degraded state (had 0 containers showing ~1-2 hrs after they seemed to have OOM and re-created by K8s) and was caused by missing volumes. Our assumption was that since volumes couldn't be found ( even base resources ) no containers could be created and all builds on those workers would continue failing indefinitely.

However, since our GC should have been cleaning up the volumes from the DB, our assumption appears to be invalid.

If this were to occur again, we can investigate further to validate why the workers remain in an unusable state;

Concrete next steps when the worker appears to be degraded due to the same reason again;

look at DB state for the worker's volumes. Is the missing_since flag being set ?
if so, do we observe the volumes being cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick on web ).
What errors are we seeing in the logs ( IIRC we were observing FindOrCreateContainer errors) ?

cirocosta commented 4 years ago

slightly related: Research Linux OOM killer behavior for cgroups #20

clarafu commented 4 years ago

We saw most likely the same problem happen to hush-house again on the v5.8 upgrade. We investigated it further and found out that it was caused by the volume collector failing to clean up missing volumes.

It was failing to clean up any missing volumes because of this error "pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes"" which was caused by a child volume that points to a parent base resource type volume that is missing. We killed all the orphans and all is well.

xtreme-sameer-vohra commented 4 years ago

Seeing this again on Hush House today. 3 workers experiencing what appear to be the same issue where they are not able to clean up volumes due to

"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes""

@clarafu mentioned that this has been addressed in v6.x.x, hence, we're not going to investigate this further. we're just retiring the workers to clean up state