Open deniseyu opened 4 years ago
Experienced this again week of Dec 16 - Dec 20th ( primarily on Dec 17th and 20th )
Here are some screenshots with what we observed
Dec 17th
Dec 20th
Note the workers with 0 containers, these had been restarted/recreated by K8s and while they were successfully registered, they were not in a usable state.
Interesting ones to note was that one node entered a ReadOnly Filesystem mode. We had to manually kill the node via the gcloud console
Hey,
With regards to the following:
Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC.
do you remember if you've seen the error happening for long periods of time, and never recovering?
it turns out that we do have the proper diffing in place during the flow of reporting volumes and getting them marked as missing or not.
e.g., during the ReportWorkerVolumes
(iniated from the worker through
`volumesweeper) flow:
i. `missing_since` gets updated during the ReportWorkerVolumes flow
this is supposed to account for the case where the database has volumes
that the worker doesn't
github.com/concourse/concourse/atc/db.diff
/src/atc/db/container_repository.go:37 +0x4e
github.com/concourse/concourse/atc/db.(*volumeRepository).UpdateVolumesMissingSince
/src/atc/db/volume_repository.go:125 +0x9d5
github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
/src/atc/api/volumeserver/report.go:65 +0xc68
net/http.HandlerFunc.ServeHTTP
/usr/local/go/src/net/http/server.go:2007 +0x44
on the db, we can verify that being set:
concourse=# select handle, worker_name, missing_since from volumes;
handle | worker_name | missing_since
--------------------------------------+-------------+-------------------------------
2a2cfbb8-ae97-4ada-4e80-6d8d64031cb7 | worker | 2019-12-24 16:00:52.270141+00
89503487-3816-4bac-7b3c-b7109cc8b946 | worker | 2019-12-24 16:00:52.270141+00
(2 rows)
ii. volumes not known by the db get transitioned to `destroying`
on a later tick, `destroy`ing volumes get actually destroyed by
the volume sweeper (in the worker itself)
github.com/concourse/concourse/atc/db.diff
/src/atc/db/container_repository.go:37 +0x4e
github.com/concourse/concourse/atc/db.(*volumeRepository).DestroyUnknownVolumes
/src/atc/db/volume_repository.go:577 +0x3d6
github.com/concourse/concourse/atc/api/volumeserver.(*Server).ReportWorkerVolumes
/src/atc/api/volumeserver/report.go:47 +0x859
later, during gc
:
i. volumes with `missing_since > threshold` get removed
(literally, `DELETE`d)
ps.: by default, that threshold
is 5min.
This means that as long as the worker gets up, after 5m (or whatever time is configured in the installation), we should have the references being cleaned up.
To see this in practice, change docker-compose.yml
to have a fixed-name and use a non-persistent directory for its volumes:
diff --git a/docker-compose.yml b/docker-compose.yml
index 400598ee0..cf8e83957 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -41,12 +41,10 @@ services:
- 7788:7788
stop_signal: SIGUSR2
environment:
+ CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+ CONCOURSE_BAGGAGECLAIM_DRIVER: naive
+ CONCOURSE_BIND_IP: 0.0.0.0
CONCOURSE_LOG_LEVEL: debug
CONCOURSE_TSA_HOST: web:2222
-
- # avoid using loopbacks
- CONCOURSE_BAGGAGECLAIM_DRIVER: overlay
-
- # so we can reach Garden/Baggageclaim for debugging
- CONCOURSE_BIND_IP: 0.0.0.0
- CONCOURSE_BAGGAGECLAIM_BIND_IP: 0.0.0.0
+ CONCOURSE_WORK_DIR: /tmp
+ CONCOURSE_NAME: worker
then run task. this will incur in volumes being created.
now, ungracefully remove the worker (docker rm -f concourse_worker_1
),
and then re-create it (docker-compose up -d worker
would do it).
this will bring the worker back with a mismatch between its state (no volumes in the worker) and the db (volumes marked as being in the worker).
if you try to run the task again, that exact error will pop up, but after few minutes,
web
will get rid of the extra volumes in the db.
it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back
Yeah, that's because prior to [gc containers and vols present in the worker but
not on the db], the worker would come back with a set of volumes that would
differ what the db had, and that'd cause problems (as we'd never gc
vols that
were in the worker but not in the db), thus, they'd accumulate forever, but it
seems like given that now we do have better logic for dealing with the
differences, we could think about not having the removal anymore (as the system
would go towards the right thing in any case).
During a graceful shutdown, the worker retires:
preStop
has a kill -s SIGUSR2 $concourse
which leads to retire, which
means that in the db
, the vols will go away, but not right away in the
worker - as that takes few ticks (it'll first become orphan, then marked as
for destruction, then destroyed by the worker - AFAIR)by always removing them all during initialization, we can ensure that they would not forever persist in that pvc
e.g.:
worker-1
vols: 1,2,3,4 (in the pvc: `vols/{1,2,3,4}`)
kubectl delete pod worker-1
t0 ==> kill -s SIGUSR2 worker-1 (preStop)
t1 ==> wait for the proc to die ...
t2 ==> worker-1 retires
t3 ==> container goes away
t4 ==> pvc stays around
vols/{1,2,3,4}
t5 worker-1 gets scheduled to antoher node ...
t6 pvc-worker-1 gets attached to worker-1
vols/{1,2,3,4} are there
t7 worker-1's initContainer gets started
==> cleans `vols/` from the pvc
t8 worker-1 container starts
==> state matches the `db` (no vols here, no vols there)
but then, if there's an ungraceful shutdown (anything that does not go
through preStop
), then the db
will not be cleaned up w/ regards to that
worker (as there'll never exist a kill -s SIGUSR2 $concourse
that makes it
go through the retire process), and if it turns out to be a whole-pod shutdown, then
the initContainer
will indeed remove the contents, whick will cause the mismatch:
worker-1
vols: 1,2,3,4 (in the pvc: `vols/{1,2,3,4}`)
t0 ==> destroy
As I've mentioned in the previous comment, the mismatch should be fine - on few
ticks, we'd see Concourse dealing with those volumes not existing in the worker
by marking them with missing_since
, and later, having them deleted.
We believe that restarting pods using kubectl won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:
oh, actually, a kubectl delete pod
would end up putting the worker through
"the right termination process" which is going through preStop
, which takes
care of doing the retire-worker
before termination, which would've cleaned up
the database-side first, and then on re-initialization, cleaning up the pvc
(which would make vols_in_db=0
and vols_in_the_worker=0
)
We could look into modifying the hard eviction limits
hmmm I don't think that's really the answer as that's something that's up to the
consumers of the Chart; even if we did remove those limits
that we have set
under resources
, it'd not really take the problem away - the kubelet
could
still do a hard eviction on node-configured values that we don't know about as
pod spec authors (aside from any other form of ungraceful termination that will
eventually happen).
e.g.:
kubectl proxy --port=8001
curl -sSL "http://localhost:8001/api/v1/nodes/gke-hush-house-workers-3-687beb5a-3fgz/proxy/configz" | jq '.kubeletconfig.evictionHard'
{
"memory.available": "100Mi",
"nodefs.available": "10%",
"nodefs.inodesFree": "5%",
"pid.available": "10%"
}
Having both remount volumes on restart and gc containers and vols present in the worker but not on the db, it seems sensible to me to consider going the route of not removing the volumes on initialization in any case.
(this can be activated in the chart via --set worker.cleanUpWorkDirOnStart=false
)
Given that we now diff on both ways, and have the volumes properly mounted when getting up (in any case), I think we're safe to go without initial removals - Concourse will take care of removing any leftovers on either sides.
ps.: while I'm aware of the fixes for doing the proper mounting for overlay, I'm not entirely sure if everything would still be alright for btrfs
in the case where loopback devices are used (although I think it's alright already)
@cirocosta I think we might have about this previously, but my memory is failing me: workers coming back under the same goes against the intent of the current worker lifecycle design, where the name is supposed to directly correspond to the lifecycle of both the processes and data (i.e. containers and volumes) managed by the worker. If they go away, the worker should just get a new name. These kinds of problems seem to be caused by not following that rule.
What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true
? With the chart becoming the preferred method to deploy Concourse, it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'. Sorry if this is reviving an old discussion, since I think it is. :sweat_smile: If you remember where we chatted about this previously feel free to just link it.
heey @vito,
These kinds of problems seem to be caused by not following that rule.
ooh, totally!
it seems like we may want to either adjust the lifecycle design more fundamentally or change how the chart works to 'fall in line'.
my impression is that with the way that nowadays we perform the diffs in both ways, we're actually very close (if not already fine) to handle state persisting around 🤔 (I might be very wrong though lol)
What's the technical reason for the chart to prefer 'stable' names as opposed to just ephemeral names + ephemeral: true?
We've been using non-ephemeral names just because of the use of StatefulSets
for our workers (see worker-statefulset.yaml
):
`StatefulSets` are valuable for applications that require one or more of the
following:
- **Stable, unique network identifiers**.
- **Stable, persistent storage**.
- Ordered, graceful deployment and scaling.
- Ordered, automated rolling updates.
(from Using StatefulSets)
As you mentioned, it diverges from what we want in terms of both the stable naming across pod starts1, and having the storage trully persistent (regardless of where the pod goes, the volume + its contents goes together with it) - properties of StatefulSets.
But, still, it gives us something that's very valuable: the ability to have templated persistent volume claims. This is what allows each pod to get a persistent volume attached to it. Without that, we'd end up consuming whatever storage is set up for ephemeral storage in that machine (in the case of GKE, e.g., that'd be the same storage used for the whole machine, including the storage for container images, other workloads, etc)2.
Leveraging StatefulSets then, we can declare that we want our workers to come
with a 1TB disk and have that disk just for the worker pods, regardless of
whatever is used for the kubelet
and other workloads in the machine (i.e.,
dedicated IOPS for the worker). With a Deployment
, that's not possible.
1: we never tweaked CONCOURSE_NAME
in the chart in any form, it's
using the hostname that gets generated by default ($statefulset-$index
), but
that could be changed, for sure.
2: that's partially true - it's possible to get a deployment bind-mounting a local disk that's separate from the regular disk used to the kubelet and others, you'd only need to request that extra local disk and make sure you're the only one bind-mounting it, having prior-knowledge of what the device name is, and also having it properly formatted, etc.
From what I understand, the best case would be to have PVCs (so that the worker pod can have all the IOPS and storage to itself, regardless of where it lands1), with pods being ephemeral, and the PVCs' lifecycle bound to the pods lifecycle (when pods go away, PVCs would go away too, so that we don't keep any state around for pods that were terminated). i.e., a blend of properties from Deployments and StatefulSets.
With the way that kubernetes is extensible (e.g., see operator pattern), that's something that we could persue - have a CRD like "EphemeralStatefulSet" and then manually control the lifecycle of pods and pvcs in a controller. should we? I don't think so :sweat_smile:
While I agree that by making Concourse very smart we can have all sort
of edge cases and problems to deal with, shooting ourselves in the foot, I think
this kind of resiliency that we're talking about here is worth having baked
in. Even in those cases of ephemeral names + ephemeral disk, we can get things
like a container restart (where the pod does not get recreated and the container
comes up faster than web
would notice a stall) where we'd need to fix the
state; or have scenarios where we can't even have ephemeral: true
given the
nature of unreliable networks.
1: the IOPS part is partially true - depending on vCPU, the IOPS can be different
damn, sorry for those walls of text 😬
It does appear that based on our GC logic that volumes in the DB but not on the worker should be cleaned up eventually ( worker GC tick to report volumes + 5 min grace time + next volume GC tick ).
Our hypothesis that the workers were in a degraded state (had 0 containers showing ~1-2 hrs after they seemed to have OOM and re-created by K8s) and was caused by missing volumes. Our assumption was that since volumes couldn't be found ( even base resources ) no containers could be created and all builds on those workers would continue failing indefinitely.
However, since our GC should have been cleaning up the volumes from the DB, our assumption appears to be invalid.
If this were to occur again, we can investigate further to validate why the workers remain in an unusable state;
Concrete next steps when the worker appears to be degraded due to the same reason again;
missing_since
flag being set ?slightly related: Research Linux OOM killer behavior for cgroups #20
We saw most likely the same problem happen to hush-house again on the v5.8 upgrade. We investigated it further and found out that it was caused by the volume collector failing to clean up missing volumes.
It was failing to clean up any missing volumes because of this error
"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes""
which was caused by a child volume that points to a parent base resource type volume that is missing. We killed all the orphans and all is well.
Seeing this again on Hush House today. 3 workers experiencing what appear to be the same issue where they are not able to clean up volumes due to
"pq: update or delete on table "volumes" violates foreign key constraint "volumes_parent_id_fkey" on table "volumes""
@clarafu mentioned that this has been addressed in v6.x.x, hence, we're not going to investigate this further. we're just retiring
the workers to clean up state
Related: https://github.com/pivotal/concourse-ops/issues/170
We have observed on Hush House that sometimes, a worker pod is ungracefully restarted by Kubernetes when memory consumption exceeds an internally-configured threshold: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior
Concourse doesn't know about this restart, and when the pod is recreated, it re-registers itself using the same name. However, when k8s recreates a pod in this way, it deletes the volumes; Concourse doesn't know that these volumes were deleted and still maintains references to them in ATC. The symptoms of this are that Hush House users will see error messages like this:
Short term pain reduction
In the short term, we should look at emitting more descriptive errors and adding some operational guidelines and documentation around how to manage this problem. We believe that restarting pods using
kubectl
won't fix the problem of ATC having an incorrect perception of workers' volumes; so we did the following:Running
prune
will remove it from the ATC DB.Long term fixes?
We could look into modifying the hard eviction limits or seeing if there's a way we can add a lifecycle hook to fire off a retire-worker request or something to ATC, to try to keep ATC DB and the k8s cluster synced up.
We can also look into changing the behaviour of the init container so that volumes are not totally destroyed when restarts happen. ~We really only want this to be true if a worker is ephemeral, but in the case of long-lived workers, it would be worthwhile to see if we can relocate and reattach those volumes when a pod comes back.~ spoke with @kcmannem - it looks like this was a deliberate engineering decision made for the Helm chart design. Will have to chat with @cirocosta when he is back
Note: We are using anti-affinity so we only have one pod per node, but it looks like if that one pod's memory is too high it still gets "evicted".