Closed kleinschmidt closed 10 months ago
Merging #215 (ca89ddc) into main (af34bb3) will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## main #215 +/- ##
=======================================
Coverage 98.12% 98.12%
=======================================
Files 12 12
Lines 639 639
=======================================
Hits 627 627
Misses 12 12
Flag | Coverage Δ | |
---|---|---|
Ray.jl | 98.12% <100.00%> (ø) |
Flags with carried forward coverage won't be shown. Click here to find out more.
Files | Coverage Δ | |
---|---|---|
src/ray_julia_jll/common.jl | 95.30% <100.00%> (ø) |
|
src/runtime.jl | 97.12% <ø> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
This is pretty simple but is going to be hard to test, so I'm going to open it for review despite lacking functional tests that tasks are actually retried as expected.
I wonder for the sake of testing do we want to allow and then force an applicatoin error? or do we want to be very strict about guarding against that?
force an applicatoin error
those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...
force an applicatoin error
those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...
what if you used mocking tothrow(OutOfMemoryError)
?
force an applicatoin error
those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...
what if you used mocking to
throw(OutOfMemoryError)
?
Alas, that would be picked up as an application error. Ray has its own monitor that detects memory pressure on the whole node and starts killing tasks when a certain threshold is reached. So you really do need to trigger a (near) OOM to activate it.
The motivation for this is that the OS doesn't care which process is which and will kill whatever it wants. So it's better for Ray to preemptively kill things that are known to replaceable (i.e., workers running tasks) to avoid triggering the OS OOM killer if at all possible: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html
force an applicatoin error
those aren't retried at all right now, only OOMs. I played around a bit with trying to force an OOM but that's pretty fragile...
what if you used mocking to
throw(OutOfMemoryError)
?Alas, that would be picked up as an application error. Ray has its own monitor that detects memory pressure on the whole node and starts killing tasks when a certain threshold is reached. So you really do need to trigger a (near) OOM to activate it.
The motivation for this is that the OS doesn't care which process is which and will kill whatever it wants. So it's better for Ray to preemptively kill things that are known to replaceable (i.e., workers running tasks) to avoid triggering the OS OOM killer if at all possible: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html
ah I see! oh well what can we do
Fixes #213 and may fix #143
Given that we're still forbidding retries of application errors, I'm not quite sure how to test this.