apoorvkh / torchrunx

Automatically initialize distributed PyTorch environments
https://torchrunx.readthedocs.io
MIT License
1 stars 0 forks source link

Propagate exceptions #59

Closed apoorvkh closed 2 days ago

apoorvkh commented 2 weeks ago

Currently, the Launcher process just raises a RuntimeError if any agents fail. I think it could be more useful to raise the actual exception from the agents. Then the user can have more conditional control at the launcher level (e.g. what to do next if there is an OutOfMemoryError vs something else).

The only problem might be if multiple agents fail: then which exception do we raise?

https://github.com/apoorvkh/torchrunx/blob/f081a00543bebe469ddae8a942a0930a45d2fe1a/src/torchrunx/launcher.py#L241-L253

apoorvkh commented 2 weeks ago

https://pytorch.org/docs/stable/elastic/errors.html