Open kleinschmidt opened 1 year ago
@kleinschmidt this was using Ray.jl v0.0.2 right?
It was on 7f3aec0a39b6db86356ad658ab89d29531303caa (the commit before teh workflows landed), and everything built from source in the docker image (using the Ray.jl-provided dockerfile as the base)
Just hit this again, same code, same circusmstances (just started a fresh cluster and submitted job, got through ~4% of the work)
...and again
Can reproduce this pretty regularly (unfortunately not in a way that I can share since it's internal data).
I also ruled out the async reducer as the root cause; using a channel to make sure the get
s happen synchronously didn't fix it.
I did a bit of poking around at whether we can check isnull(buffer[])
in take!
as a guard against this, but when buffer[]
is null pointer, we get the RayBuffer deleted
error, not a segfault (at least on my M1 mac; maybe manifests differently on linux where I was running this...)
The next step for debugging this would probably be to just print the entire bytes during deserialization if HasMetadata
returns true to see if we can reconstruct what's going on in a self-contained way.
This was observed during some internal Beacon benchmarking of a large job (~25k tasks + a reduce step) with kuberay. I haven't been able to reproduce (re-submitting exactly the same job with no changes has been running smoothly).
Ray.get
here is being called in the context of an async reduction step. The tasks being reduced over are generated viamap
a task that returns a DataFrame, and then reduced like this:My hunch is that there may be some kind of race condition here, where the julia async
Task
s are somehow yielding in such a way as to cause the underlying memory to be freed. But really only a hunch. Full stacktrace from the segfault is below. The other thing I could think of off the top of my head is that there's something we're not handling around non-localmemorybuffer buffers, but it's hard to say. The mysterious thing is that we're basically only interacting with the metadata via CoreWorker API methods (gate a call toGetMetadata
behind a call toHasMetadata
:https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/src/ray_julia_jll/common.jl#L258-L267
both of which are directly wrapping the C++ methods:
https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/build/wrapper.cc#L649-L661
Logs
``` Assembling segments... 4%|▋ | ETA: 1:51:31 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:50:47 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:50:04 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:49:17 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:48:35 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:52:43 ( 0.29 s/it) Assembling segments... 5%|▊ | ETA: 1:43:25 ( 0.27 s/it) Assembling segments... 5%|▊ | ETA: 1:40:53 ( 0.26 s/it) Assembling segments... 5%|▊ | ETA: 1:40:15 ( 0.26 s/it) [4357] signal (11.1): Segmentation fault in expression starting at /tmp/ray/session_2023-10-20_14-17-46_414642_1/runtime_resources/working_dir_files/_ray_pkg_47145e2793bb888e/migration.jl:7 _ZNSt17_Function_handlerIFPhRKN3ray6BufferEEZN5jlcxx11TypeWrapperIS2_E6methodIS0_S2_JEEERS8_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEMT0_KFT_DpT1_EEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_ at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line) _ZN5jlcxx6detail11CallFunctorIPhJRKN3ray6BufferEEE5applyEPKvNS_13WrappedCppPtrE at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line) Data at /usr/local/share/julia-depot/ab14e38af3/packages/CxxWrap/aXNBY/src/CxxWrap.jl:624 [inlined] take! at /usr/local/share/julia-depot/4c99afcd78/dev/