Open hebiao064 opened 2 weeks ago
Thank you for opening your first issue here! 🛠
@hebiao064 , reproing this is difficult as you can imagine, machines that have multiple GPUs are not easy to come by.
Can you paste the rest of the stack trace? That might give us the next clue to look into this bug while we procure a machine to help in the investigation.
Describe the bug
Bug Report: Single Node Multiple GPU HorovodJob Failure with pyflyte-fast-execute
Issue Description
When executing a Flyte workflow containing a Multiple GPU HorovodJob using
pyflyte-fast-execute
, the HorovodJob fails with aFile Exists
error. However, Single GPU or Multiple Node (each node with 1 GPU) configurations work as expected.Error Message
Current Behavior
Additional Information
pyflyte-fast-execute
Next Steps
Expected behavior
Expected Behavior
Single Node Multiple GPU Horovod Job should succeed when executed with
pyflyte-fast-execute
Additional context to reproduce
Root Cause
The issue stems from the HorovodJob attempting to run three parts of commands sequentially:
horovod run ...
pyflyte-fast-execute ...
3, Pyflyte execute task:pyflyte-execute ...
This sequence appears to be causing conflicts since
pyflyte-fast-execute
command will be ran on each process hence it will unzip the fast registration tar.gz onto the same node multiple times.Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?