aws-deadline / deadline-cloud-worker-agent

The AWS Deadline Cloud worker agent can be used to run a worker in an AWS Deadline Cloud fleet.
Apache License 2.0
15 stars 21 forks source link

fix: crash when host has multiple NVIDIA GPUs #435

Closed jusiskin closed 1 month ago

jusiskin commented 1 month ago

What was the problem/requirement? (What/Why)

  1. When trying to run the worker agent on a host with multiple NVIDIA GPUs, the worker agent crashes. For example, running on a Windows g4dn.metal EC2 instance, we see the following in the worker agent logs:

    [2024-10-03 17:01:03,401][INFO    ] 👋 Worker Agent starting
    [2024-10-03 17:01:03,403][INFO    ] AgentInfo 
    Python Interpreter: C:\Program Files\Python310\pythonservice.exe
    Python Version: 3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]
    Platform: win32
    Agent Version: 0.27.2
    Installed at: C:\Program Files\Python310\Lib\site-packages
    Running as user: deadline-worker
    Dependency versions installed:
        openjd.model: 0.4.4
        openjd.sessions: 0.8.2
        deadline.job_attachments: 0.48.8
    [2024-10-03 17:01:03,679][INFO    ] Number of GPUs: 8
    
    8
    
    8
    
    8
    
    8
    
    8
    
    8
    
    8
    [2024-10-03 17:01:03,680][CRITICAL] invalid literal for int() with base 10: '8\r\n8\r\n8\r\n8\r\n8\r\n8\r\n8\r\n8'
    [2024-10-03 17:01:03,687][INFO    ] Deadline Cloud telemetry is enabled.
    [2024-10-03 17:01:04,491][INFO    ] 🚪 Worker Agent exiting
  2. If the above issue is fixed, there was no logic for handling the aggregation of GPU memory when there are multiple GPUs. The worker agent is aiming to be compliant with the amount.worker.gpu.memory host capability specified in OpenJD. Before working on this issue, the specification was ambiguous about how a render management system should aggregate this as a single numeric value when there are multiple GPUs. This was recently clarified in OpenJobDescription/openjd-specifications#51, which now states:

    The lower bound of total memory provided by each GPU on the host. For example, if a host has one GPU with 4096 and one GPU with 2048, this value would be 2048. Units: MiB.| |amount.worker.disk.scratch|0|A static amount of disk storage installed on the host for use as scratch space. Units: GiB.

What was the solution? (How)

  1. The worker agent ran nvidia-smi --query-gpu=count --format=csv,noheader in a subprocess, captured output and tried to parse this to an integer. The problem is that when there are multiple GPUs, nvidia-smi will output the count of GPUs multiple times (once per GPU).

    To fix this, we can pass the -i=0 argument to only report the count once for the first GPU

  2. Modified the GPU memory detection to support multiple GPUs and aggregate by taking the minimum total available memory from all GPUs.

What is the impact of this change?

The worker agent will not crash when there are multiple NVIDIA GPUs detected and will report the count of GPUs and the minimum GPU memory of all GPU accelerators on the worker host (as reported by nvidia-smi).

How was this change tested?

  1. Modified the unit tests to account for the logic change and confirmed they now pass
  2. Ran the modified code end-to-end on both a g4dn.12xlarge EC2 instance (with 4 GPUs) and g5.xlarge EC2 instance with a single GPU. Inspected the logs and confirmed they log and report the correct count of GPUs and memory:

    On g4dn.12xlarge:

    [2024-10-09 17:31:15,652][INFO    ] Number of GPUs: 4
    [2024-10-09 17:31:15,714][INFO    ] Minimum total memory of all GPUs: 15360

    On g5.xlarge:

    [2024-10-09 17:26:40,913][INFO    ] Number of GPUs: 1
    [2024-10-09 17:26:40,971][INFO    ] Minimum total memory of all GPUs: 23028

Was this change documented?

No

Is this a breaking change?

No


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud