Open aptenodytes-forsteri opened 6 months ago
I have experienced the same issue.
I mount only one bucket but using multiple mount points for multiple paths. However, I wrote less than 20 files to mounted folders (each file's size was less than 500 kb). This issue sometime happened, which caused my k8s job fail due to waiting too long.
It seemed that all files were written to GCS. But gcs-fuse sidecar was stuck when handling SIGTERM.
Can using gcs-fuse as init-container (as described in https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/issues/168#issuecomment-2041323536) fix the problem?
Summary
I'm having an issue where, over time, I end up with many "stuck" pods where the main container has errored or exited and the sidecar is running. This may have to do with having two buckets mounted or with trying to write a lot of files to the out bucket.
Details
I'm running this version of the csi-driver:
To reproduce the issue, I have mounted two buckets, an "in" and an "out" bucket.
The container runs a simple python script that writes to the out bucket then errors e.g.:
If I run this container enough times, I end up with an increasing number of "hanging" pods which never terminate.
Here is some of the log output from the sidecar container:
We see sigterm sent to both input and output buckets, but only the input bucket actually gets terminated.
Expected
The sidecar always terminates when the main container exits.
Actual
The sidecar sometimes stays running because the second mount process never exits.
Hunches
Perhaps there is some implicit assumption of one bucket and when two are mounted, there's some hard coded shared state that makes the termination flaky?
Perhaps while GCS fuse is "busy" writing it can't receive the sigterm?