Closed 7ing closed 9 months ago
In a rare case, if a volume does not contain a metadata file, the driver won't start up (keep crashing) with manager.NewManagerOrDie. This is because the driver tries to ensure all volumes are accessible. If not, it will return an error that cause a panic. https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/storage/filesystem.go#L112-L122
manager.NewManagerOrDie
Here is a sample error code we have seen:
panic: failed to start manager: listing existing volumes: open csi-data-dir/inmemfs/csi-04b4520253413b4a3120e28f454cc781c11a759627c25a8bbd3f536e8c1c2020/metadata.json: no such file or directory goroutine 1 [running]: github.com/cert-manager/csi-lib/manager.NewManagerOrDie({{0x20d1870, 0xc000490450}, 0x0, {0x20cb308, 0xc0004c5b80}, {0x20e4388, 0x3023568}, 0xc000490588, 0x0, {0x7ffdf920c895, ...}, ...}) /go/pkg/mod/github.com/cert-manager/csi-lib@v0.5.0/manager/manager.go:237 +0xa5
This situation could happen when Pod is created --> NodePublishVolume called --> volume created --> OOM. Thus following defer function would never be called: https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/driver/nodeserver.go#L56-L62
Pod is created --> NodePublishVolume called --> volume created --> OOM
To fix this issue, we may simply log / delete the volume and proceed the rest of jobs.
Another thought, why don't we use os.Stat to verify the file existence, instead of fs.Open and fs.Close call here?
os.Stat
fs.Open
fs.Close
Problem Statement
In a rare case, if a volume does not contain a metadata file, the driver won't start up (keep crashing) with
manager.NewManagerOrDie
. This is because the driver tries to ensure all volumes are accessible. If not, it will return an error that cause a panic. https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/storage/filesystem.go#L112-L122Here is a sample error code we have seen:
This situation could happen when
Pod is created --> NodePublishVolume called --> volume created --> OOM
. Thus following defer function would never be called: https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/driver/nodeserver.go#L56-L62The Fix
To fix this issue, we may simply log / delete the volume and proceed the rest of jobs.