NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.17k stars 1.27k forks source link

Hibernate/resume doesn't work. #480

Closed shuox closed 1 year ago

shuox commented 1 year ago

NVIDIA Open GPU Kernel Modules Version

530.41.03

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Ubuntu 18.04.5 LTS

Kernel Release

4.15.0-147-generic #151-Ubuntu SMP Fri Jun 18 19:21:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Hardware: GPU

NVIDIA A10

Describe the bug

Try to use hibernate and resume but failed.

To Reproduce

The steps to reproduce the issue: ENV: /etc/modprobe.d/nvidia.conf: options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_EnableGpuFirmware=0

  1. run a tensor2tensor train workload in a docker.
  2. echo hibernate > /proc/driver/nvidia/suspend
  3. echo 1 > /sys/devices/pci0000\:fe/0000\:fe\:00.0/0000\:ff\:00.0/reset
  4. echo resume > /proc/driver/nvidia/suspend

Where 0000:ff:00.0 is the GPU pci device.

The result: The tensor2tensor process stucks.

Bug Incidence

Always

nvidia-bug-report.log.gz

no need bug report. Always reproduce.

More Info

No response

dylanmtaylor commented 1 year ago

I think this is a duplicate of #472.

amrit1711 commented 1 year ago

This issue seems to be duplicate of #472 and it would be good to update further progress in a single thread for better tracking purpose. I will go ahead and close this thread now.