cert-manager / csi-lib

A library for building CSI drivers that request certificates from cert-manager
Apache License 2.0
14 stars 13 forks source link

Driver won't recover if a volume does not contain a metadata file #58

Closed 7ing closed 9 months ago

7ing commented 10 months ago

Problem Statement

In a rare case, if a volume does not contain a metadata file, the driver won't start up (keep crashing) with manager.NewManagerOrDie. This is because the driver tries to ensure all volumes are accessible. If not, it will return an error that cause a panic. https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/storage/filesystem.go#L112-L122

Here is a sample error code we have seen:

panic: failed to start manager: listing existing volumes: open csi-data-dir/inmemfs/csi-04b4520253413b4a3120e28f454cc781c11a759627c25a8bbd3f536e8c1c2020/metadata.json: no such file or directory

goroutine 1 [running]:
github.com/cert-manager/csi-lib/manager.NewManagerOrDie({{0x20d1870, 0xc000490450}, 0x0, {0x20cb308, 0xc0004c5b80}, {0x20e4388, 0x3023568}, 0xc000490588, 0x0, {0x7ffdf920c895, ...}, ...})
    /go/pkg/mod/github.com/cert-manager/csi-lib@v0.5.0/manager/manager.go:237 +0xa5

This situation could happen when Pod is created --> NodePublishVolume called --> volume created --> OOM. Thus following defer function would never be called: https://github.com/cert-manager/csi-lib/blob/9811918c72abe4e60bb2c6303346460a804b12b5/driver/nodeserver.go#L56-L62

The Fix

To fix this issue, we may simply log / delete the volume and proceed the rest of jobs.

7ing commented 10 months ago

Another thought, why don't we use os.Stat to verify the file existence, instead of fs.Open and fs.Close call here?