flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.67k stars 640 forks source link

[BUG] Single Node Multiple GPU HorovodJob Failure with pyflyte-fast-execute #5800

Open hebiao064 opened 2 weeks ago

hebiao064 commented 2 weeks ago

Describe the bug

Bug Report: Single Node Multiple GPU HorovodJob Failure with pyflyte-fast-execute

Issue Description

When executing a Flyte workflow containing a Multiple GPU HorovodJob using pyflyte-fast-execute, the HorovodJob fails with a File Exists error. However, Single GPU or Multiple Node (each node with 1 GPU) configurations work as expected.

Error Message

<stderr>:tar: ./main/resources/{redacted}: Cannot open: File exists

Current Behavior

Additional Information

Next Steps

  1. Investigate the file system operations during job execution
  2. Check for potential race conditions in resource allocation
  3. Review HorovodJob implementation for multi-GPU support
  4. Test with different GPU configurations to isolate the issue

Expected behavior

Expected Behavior

Single Node Multiple GPU Horovod Job should succeed when executed with pyflyte-fast-execute

Additional context to reproduce

Root Cause

The issue stems from the HorovodJob attempting to run three parts of commands sequentially:

  1. Horovod prefix command: horovod run ...
  2. Pyflyte fast execute: pyflyte-fast-execute ... 3, Pyflyte execute task: pyflyte-execute ...

This sequence appears to be causing conflicts since pyflyte-fast-execute command will be ran on each process hence it will unzip the fast registration tar.gz onto the same node multiple times.

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 2 weeks ago

Thank you for opening your first issue here! 🛠

eapolinario commented 2 days ago

@hebiao064 , reproing this is difficult as you can imagine, machines that have multiple GPUs are not easy to come by.

Can you paste the rest of the stack trace? That might give us the next clue to look into this bug while we procure a machine to help in the investigation.