NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.65k stars 605 forks source link

[gfd] Add option to disable automatic cleanup features file on gpu-feature-discovery exit #796

Open belo4ya opened 1 month ago

belo4ya commented 1 month ago

Issue description

We use the node-feature-discovery and gpu-feature-discovery features to monitor GPU issues, including cases when the number of available GPUs on a node unexpectedly decreases: Target Number == nvidia.com/gpu.count == Node Allocatable.

We have noticed that sometimes after restarting gpu-feature-discovery, all the features (labels nvidia.com/*) exported by gpu-feature-discovery disappear from the node for a period roughly equal to the nfd-worker sleepInterval (in our case, 1 minute). This causes false positives in our monitoring system.

We found that this occurs because gpu-feature-discovery deletes the features.d/gfd file before terminating if it is not running in one-shot mode (done using the removeOutputFile function).

This behavior is very inconvenient (and undesirable) for us, especially when updating the gpu-feature-discovery version in the cluster.

Feature request

I found that this behavior was added with this commit - https://github.com/NVIDIA/gpu-feature-discovery/commit/bc91c4aec84c2bc3e6da47789d6d0a0326330455. However, I did not find an associated Issue justifying the need for this behavior.

Could you please consider:

  1. Adding an option to disable automatic cleanup before gpu-feature-discovery terminates using a flag (and/or environment variable) (e.g., --no-cleanup-on-exit).
  2. Or the refusal to automatically clean up before shutting down the gpu-feature-discovery.

An argument for 2. could be that node-feature-discovery does not do this. Instead, it uses a prune-job.

belo4ya commented 1 month ago

@elezar, @klueska, @ArangoGutierrez, please take a look at this

elezar commented 2 weeks ago

Thanks @belo4ya, I have created #899 to add this option and we can continue this discussion there.

@ArangoGutierrez one thing that I noted is that we don't do any cleanup when the NodeFeatureAPI is used. How are labels removed in this case?