When trying to run the worker agent on a host with multiple NVIDIA GPUs, the worker agent crashes. For example, running on a Windows g4dn.metal EC2 instance, we see the following in the worker agent logs:
[2024-10-03 17:01:03,401][INFO ] 👋 Worker Agent starting
[2024-10-03 17:01:03,403][INFO ] AgentInfo
Python Interpreter: C:\Program Files\Python310\pythonservice.exe
Python Version: 3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]
Platform: win32
Agent Version: 0.27.2
Installed at: C:\Program Files\Python310\Lib\site-packages
Running as user: deadline-worker
Dependency versions installed:
openjd.model: 0.4.4
openjd.sessions: 0.8.2
deadline.job_attachments: 0.48.8
[2024-10-03 17:01:03,679][INFO ] Number of GPUs: 8
8
8
8
8
8
8
8
[2024-10-03 17:01:03,680][CRITICAL] invalid literal for int() with base 10: '8\r\n8\r\n8\r\n8\r\n8\r\n8\r\n8\r\n8'
[2024-10-03 17:01:03,687][INFO ] Deadline Cloud telemetry is enabled.
[2024-10-03 17:01:04,491][INFO ] 🚪 Worker Agent exiting
If the above issue is fixed, there was no logic for handling the aggregation of GPU memory when there are multiple GPUs. The worker agent is aiming to be compliant with the amount.worker.gpu.memory host capability specified in OpenJD. Before working on this issue, the specification was ambiguous about how a render management system should aggregate this as a single numeric value when there are multiple GPUs. This was recently clarified in OpenJobDescription/openjd-specifications#51, which now states:
The lower bound of total memory provided by each GPU on the host. For example, if a host has one GPU with 4096 and one GPU with 2048, this value would be 2048. Units: MiB.|
|amount.worker.disk.scratch|0|A static amount of disk storage installed on the host for use as scratch space. Units: GiB.
What was the solution? (How)
The worker agent ran nvidia-smi --query-gpu=count --format=csv,noheader in a subprocess, captured output and tried to parse this to an integer. The problem is that when there are multiple GPUs, nvidia-smi will output the count of GPUs multiple times (once per GPU).
To fix this, we can pass the -i=0 argument to only report the count once for the first GPU
Modified the GPU memory detection to support multiple GPUs and aggregate by taking the minimum total available memory from all GPUs.
What is the impact of this change?
The worker agent will not crash when there are multiple NVIDIA GPUs detected and will report the count of GPUs and the minimum GPU memory of all GPU accelerators on the worker host (as reported by nvidia-smi).
How was this change tested?
Modified the unit tests to account for the logic change and confirmed they now pass
Ran the modified code end-to-end on both a g4dn.12xlarge EC2 instance (with 4 GPUs) and g5.xlarge EC2 instance with a single GPU. Inspected the logs and confirmed they log and report the correct count of GPUs and memory:
On g4dn.12xlarge:
[2024-10-09 17:31:15,652][INFO ] Number of GPUs: 4
[2024-10-09 17:31:15,714][INFO ] Minimum total memory of all GPUs: 15360
On g5.xlarge:
[2024-10-09 17:26:40,913][INFO ] Number of GPUs: 1
[2024-10-09 17:26:40,971][INFO ] Minimum total memory of all GPUs: 23028
Was this change documented?
No
Is this a breaking change?
No
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
What was the problem/requirement? (What/Why)
When trying to run the worker agent on a host with multiple NVIDIA GPUs, the worker agent crashes. For example, running on a Windows
g4dn.metal
EC2 instance, we see the following in the worker agent logs:If the above issue is fixed, there was no logic for handling the aggregation of GPU memory when there are multiple GPUs. The worker agent is aiming to be compliant with the
amount.worker.gpu.memory
host capability specified in OpenJD. Before working on this issue, the specification was ambiguous about how a render management system should aggregate this as a single numeric value when there are multiple GPUs. This was recently clarified in OpenJobDescription/openjd-specifications#51, which now states:What was the solution? (How)
The worker agent ran
nvidia-smi --query-gpu=count --format=csv,noheader
in a subprocess, captured output and tried to parse this to an integer. The problem is that when there are multiple GPUs,nvidia-smi
will output the count of GPUs multiple times (once per GPU).To fix this, we can pass the
-i=0
argument to only report the count once for the first GPUWhat is the impact of this change?
The worker agent will not crash when there are multiple NVIDIA GPUs detected and will report the count of GPUs and the minimum GPU memory of all GPU accelerators on the worker host (as reported by
nvidia-smi
).How was this change tested?
Ran the modified code end-to-end on both a
g4dn.12xlarge
EC2 instance (with 4 GPUs) andg5.xlarge
EC2 instance with a single GPU. Inspected the logs and confirmed they log and report the correct count of GPUs and memory:On
g4dn.12xlarge
:On
g5.xlarge
:Was this change documented?
No
Is this a breaking change?
No
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.