beacon-biosignals / Ray.jl

Julia API for Ray
Other
11 stars 1 forks source link

Allow users to set `max_retries` when submitting a task #213

Closed kleinschmidt closed 1 year ago

kleinschmidt commented 1 year ago

We currently hardcode this as 0 here:

https://github.com/beacon-biosignals/Ray.jl/blob/836b43b580faae468ffbd15ac66cb54651e1a714/build/wrapper.cc#L174

I think it's a simple matter of passing through an Int32 in submit_task.

From digging into the actual ray source a bit, I think the other apparently relevant task option (retry_exceptions and serialized_retry_exception_allowlist) are for retrying application errors, which we don't want to handle now. I'm not 100% sure about the retry_exceptions bit but from a bit of digging, it seems like it's only checked when the allowlist is consulted in the raylet code...

https://github.com/beacon-biosignals/ray/blob/015518473a40997c1ee1591c24c65483377971cc/src/ray/core_worker/core_worker.h#L834-L837

kleinschmidt commented 1 year ago

specifically, OOM retries seem to be managed by a separate counter:

https://github.com/beacon-biosignals/ray/blob/015518473a40997c1ee1591c24c65483377971cc/src/ray/common/ray_config_def.h#L93-L100

kleinschmidt commented 1 year ago

...and checks against this counter are gated by max_retries != 0:

https://github.com/beacon-biosignals/ray/blob/015518473a40997c1ee1591c24c65483377971cc/src/ray/core_worker/task_manager.cc#L152-L153