beacon-biosignals / Ray.jl

Julia API for Ray
Other
9 stars 1 forks source link

pass `max_retries` during task submission #215

Closed kleinschmidt closed 10 months ago

kleinschmidt commented 11 months ago

Fixes #213 and may fix #143

Given that we're still forbidding retries of application errors, I'm not quite sure how to test this.

codecov[bot] commented 11 months ago

Codecov Report

Merging #215 (ca89ddc) into main (af34bb3) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #215   +/-   ##
=======================================
  Coverage   98.12%   98.12%           
=======================================
  Files          12       12           
  Lines         639      639           
=======================================
  Hits          627      627           
  Misses         12       12           
Flag Coverage Δ
Ray.jl 98.12% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/ray_julia_jll/common.jl 95.30% <100.00%> (ø)
src/runtime.jl 97.12% <ø> (ø)

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kleinschmidt commented 10 months ago

This is pretty simple but is going to be hard to test, so I'm going to open it for review despite lacking functional tests that tasks are actually retried as expected.

glennmoy commented 10 months ago

I wonder for the sake of testing do we want to allow and then force an applicatoin error? or do we want to be very strict about guarding against that?

kleinschmidt commented 10 months ago

force an applicatoin error

those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...

glennmoy commented 10 months ago

force an applicatoin error

those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...

what if you used mocking tothrow(OutOfMemoryError)?

kleinschmidt commented 10 months ago

force an applicatoin error

those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...

what if you used mocking tothrow(OutOfMemoryError)?

Alas, that would be picked up as an application error. Ray has its own monitor that detects memory pressure on the whole node and starts killing tasks when a certain threshold is reached. So you really do need to trigger a (near) OOM to activate it.

The motivation for this is that the OS doesn't care which process is which and will kill whatever it wants. So it's better for Ray to preemptively kill things that are known to replaceable (i.e., workers running tasks) to avoid triggering the OS OOM killer if at all possible: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html

glennmoy commented 10 months ago

force an applicatoin error

those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...

what if you used mocking tothrow(OutOfMemoryError)?

Alas, that would be picked up as an application error. Ray has its own monitor that detects memory pressure on the whole node and starts killing tasks when a certain threshold is reached. So you really do need to trigger a (near) OOM to activate it.

The motivation for this is that the OS doesn't care which process is which and will kill whatever it wants. So it's better for Ray to preemptively kill things that are known to replaceable (i.e., workers running tasks) to avoid triggering the OS OOM killer if at all possible: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html

ah I see! oh well what can we do