DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
47 stars 13 forks source link

NVIDIA-SMI failed on vGPU instance #1170

Open homework36 opened 5 months ago

homework36 commented 5 months ago

Not related to local or staging

Same issue as in #1161 In short, we have

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

despite successful installation of all related things. What makes this disturbing is it does not happen immediately but happens some time (potentially in the middle of a job) after we thought things should be running properly.

I just found out this happened to all 5 instances I have tested for Rodan prod so far, from Ubuntu 20, 22 to Debian 11, with both vGPU flavors, including those currently not ported to rodan2.simssa.ca (and thus not used by anyone). I suspect it is an issue with Arbutus and I'll send an email once I collect all information regarding this issue. Since they protect their vGPU drivers from the public, there is nothing much we can do from our end.

Things tested:

  1. restart NVIDIA service
    root@prod-rodan-slim:~# sudo systemctl restart nvidia-persistenced
    Failed to restart nvidia-persistenced.service: Unit nvidia-persistenced.service not found.
  2. check and reload NVIDIA modules
    root@prod-rodan-slim:~# lsmod | grep nvidia
    root@prod-rodan-slim:~# sudo modprobe nvidia
    modprobe: ERROR: could not insert 'nvidia': Exec format error
  3. detect NVIDIA GPU
    root@prod-rodan-slim:~# lspci | grep -i nvidia
    00:05.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)
  4. check system logs
    Jun 10 14:35:21 prod-rodan-slim kernel: nvidia: disagrees about version of symbol module_layout
    Jun 10 14:35:23 prod-rodan-slim systemd[1]: nvidia-gridd.service: Scheduled restart job, restart counter is at 12228.
    ░░ Automatic restarting of the unit nvidia-gridd.service has been scheduled, as the result for
    Jun 10 14:35:23 prod-rodan-slim systemd[1]: Stopped NVIDIA Grid Daemon.
    ░░ Subject: A stop job for unit nvidia-gridd.service has finished
    ░░ A stop job for unit nvidia-gridd.service has finished.
    Jun 10 14:35:23 prod-rodan-slim systemd[1]: Starting NVIDIA Grid Daemon...
    ░░ Subject: A start job for unit nvidia-gridd.service has begun execution
    ░░ A start job for unit nvidia-gridd.service has begun execution.
    Jun 10 14:35:23 prod-rodan-slim systemd[1]: Started NVIDIA Grid Daemon.
    ░░ Subject: A start job for unit nvidia-gridd.service has finished successfully
    ░░ A start job for unit nvidia-gridd.service has finished successfully.
    Jun 10 14:35:23 prod-rodan-slim nvidia-gridd[3840500]: Started (3840500)
    Jun 10 14:35:23 prod-rodan-slim nvidia-gridd[3840500]:  Failed to initialise RM client

Testing current prod server with training job that needs GPU: Used the pixel zip that works on staging, got this error message

[2024-06-11 09:33:28,640: INFO/MainProcess] Received task: Training model for Patchwise Analysis of Music Document, Training[0008c93e-9424-4f2d-97ff-40db78d0b374]
[2024-06-11 09:33:29,151: INFO/ForkPoolWorker-4] started running the task!
[2024-06-11 09:33:55,560: INFO/ForkPoolWorker-4] Checking batch size
[2024-06-11 09:33:55,561: INFO/ForkPoolWorker-4] Image 1
[2024-06-11 09:33:56,345: INFO/ForkPoolWorker-4] Checking rgba PNG - Layer 2
[2024-06-11 09:33:56,552: INFO/ForkPoolWorker-4] Checking rgba PNG - Layer 0 (Background)
[2024-06-11 09:33:56,753: INFO/ForkPoolWorker-4] Checking rgba PNG - Layer 1
[2024-06-11 09:33:57,059: WARNING/ForkPoolWorker-4] Training model for Patchwise Analysis of Music Document, Training[0008c93e-9424-4f2d-97ff-40db78d0b374]: Creating data generators...
[2024-06-11 09:33:57,140: WARNING/ForkPoolWorker-4] Training model for Patchwise Analysis of Music Document, Training[0008c93e-9424-4f2d-97ff-40db78d0b374]: Finishing the Fast CM trainer job.
[2024-06-11 09:33:57,145: INFO/ForkPoolWorker-4] ran the task and the returned object is True
[2024-06-11 09:33:57,768: WARNING/ForkPoolWorker-4] The my_error_information method is not implemented properly (or not implemented at all). Exception:
[2024-06-11 09:33:57,769: WARNING/ForkPoolWorker-4] TypeError: 'NoneType' object is not subscriptable
[2024-06-11 09:33:57,769: WARNING/ForkPoolWorker-4] Using default sources for error information.
[2024-06-11 09:33:58,577: ERROR/ForkPoolWorker-4] Task Training model for Patchwise Analysis of Music Document, Training[0008c93e-9424-4f2d-97ff-40db78d0b374] raised unexpected: RuntimeError("The job did not produce the output file for Model 2.\n\n{'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('9343c6cd-16fb-47c6-9db1-3f178093958c'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/6ecbcfd6-6174-426d-8f83-9b60df44911d'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('a00ff6ad-3bb9-4063-888c-47cba17c889f'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/d2d9e84b-faa1-4787-96d7-0c32ebc29a01'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('898958bb-0d0e-4980-83a6-d2f3241ac337'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/86880587-d5f2-4619-a7ed-f635d8dca8ea'}], 'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('9ea6bc87-8a9d-4893-981b-829042da2a35'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/64cb7761-0c02-4cce-811a-3c8c13f683c5'}]}")
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/code/Rodan/rodan/jobs/base.py", line 843, in run
    ).format(opt_name, outputs)
RuntimeError: The job did not produce the output file for Model 2.

{'Model 2': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('9343c6cd-16fb-47c6-9db1-3f178093958c'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/6ecbcfd6-6174-426d-8f83-9b60df44911d'}], 'Model 1': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('a00ff6ad-3bb9-4063-888c-47cba17c889f'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/d2d9e84b-faa1-4787-96d7-0c32ebc29a01'}], 'Background Model': [{'resource_type': 'keras/model+hdf5', 'uuid': UUID('898958bb-0d0e-4980-83a6-d2f3241ac337'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/86880587-d5f2-4619-a7ed-f635d8dca8ea'}], 'Log File': [{'resource_type': 'text/plain', 'uuid': UUID('9ea6bc87-8a9d-4893-981b-829042da2a35'), 'is_list': False, 'resource_temp_path': '/tmp/tmpny1siph9/64cb7761-0c02-4cce-811a-3c8c13f683c5'}]}

However, we are able to access GPU within this container

Python 3.7.5 (default, Dec  9 2021, 17:04:37)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-06-11 09:56:35.154146: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2024-06-11 09:58:25.540897: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2024-06-11 09:58:25.560575: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-11 09:58:25.561325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:00:05.0 name: GRID V100D-8C computeCapability: 7.0
coreClock: 1.38GHz coreCount: 80 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 836.37GiB/s
2024-06-11 09:58:25.561421: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
......
2024-06-11 09:58:25.754680: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2024-06-11 09:58:25.755317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
Num GPUs Available:  1

I did some simple computation using tensorflow and monitored the GPU usage. It seems to work fine.

root@prod-rodan-slim2:/srv/webapps/Rodan# nvidia-smi
Tue Jun 11 14:22:04 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06   Driver Version: 470.239.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID V100D-8C       On   | 00000000:00:05.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |   7269MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    891984      C   python3                          6709MiB |
+-----------------------------------------------------------------------------+

There might be a different issue.

Update: just saw this line from Compute Canada vGPU page which i missed earlier

The CUDA toolkit is not pre-installed but you can install it directly from NVIDIA or load it from the CVMFS software stack.

Not sure if it is related because our documentation does not ask for CUDA toolkit but the GPU-celery container Dockerfile has requirement for CUDA.

Unrelated fun things: The GPU used on staging (Tesla K80) came out in 2014 and is now worth $83.

homework36 commented 5 months ago

New instances work with other GPU jobs but not PACO training (#1181). The GPU is accessible.

homework36 commented 4 months ago

With updated hypervisors, this issue has come back again with our vGPU instances...

homework36 commented 4 months ago

fixed

homework36 commented 4 months ago

This issue came back on staging...