Currently, the Launcher process just raises a RuntimeError if any agents fail. I think it could be more useful to raise the actual exception from the agents. Then the user can have more conditional control at the launcher level (e.g. what to do next if there is an OutOfMemoryError vs something else).
The only problem might be if multiple agents fail: then which exception do we raise?
Currently, the Launcher process just raises a
RuntimeError
if any agents fail. I think it could be more useful to raise the actual exception from the agents. Then the user can have more conditional control at the launcher level (e.g. what to do next if there is anOutOfMemoryError
vs something else).The only problem might be if multiple agents fail: then which exception do we raise?
https://github.com/apoorvkh/torchrunx/blob/f081a00543bebe469ddae8a942a0930a45d2fe1a/src/torchrunx/launcher.py#L241-L253