NVIDIA / go-nvml

Go Bindings for the NVIDIA Management Library (NVML)
Apache License 2.0
290 stars 62 forks source link

bugs for DeviceGetHandleByUUID #41

Open jiaozhentian opened 2 years ago

jiaozhentian commented 2 years ago

I wrote some code to get the status of the GPU, where get handle by uuid. But sometimes the code will get the ERROR_NOT_FOUND (6) of the error, it not always happened. here are my codes:

    if ret != nvml.SUCCESS {
        log.Printf("Failed to initialize NVML: %s\n", nvml.ErrorString(ret))
    }
    defer func() {
        ret := nvml.Shutdown()
        if ret != nvml.SUCCESS {
            log.Printf("Failed to shut down NVML: %s\n", nvml.ErrorString(ret))
        }
    }()
    device, ret := nvml.DeviceGetHandleByUUID(gpu_uuid)
    for ret != nvml.SUCCESS {
        log.Printf("Failed to get device handle: %s\n", nvml.ErrorString(ret))
        ret = nvml.Shutdown()
        time.Sleep(time.Second * 5)
        ret = nvml.Init()
        time.Sleep(time.Second * 1)
        device, ret = nvml.DeviceGetHandleByUUID(gpu_uuid)
    }
    memory, ret := device.GetMemoryInfo()
    if ret != nvml.SUCCESS {
        log.Printf("Failed to get device memory info: %s\n", nvml.ErrorString(ret))
    }

I try to address it by restart the nvml connection service in codes, but it still get that wrong. However, when the function is over, I do not stop debug, I give it a uuid by gRPC,, DeviceGetHandByUUID can work normally, that is wired. Anyone help me to fix this bugs?

jiaozhentian commented 2 years ago

I tried to use nvml.DeviceGetHandleBySerial instead nvml.DeviceGetHandleByUUID, it works smoothly, have no idea why the function of uuid went wrong sometimes.

elezar commented 2 years ago

@jiaozhentian it may be that @klueska addressed this in https://github.com/NVIDIA/go-nvml/pull/48. Would you be able to try with the latest version?

alexbagirov commented 1 year ago

Hi. This problem still persists. Do you know any workarounds?