beacon-biosignals / Ray.jl

Julia API for Ray
Other
11 stars 1 forks source link

segfault when accessing metadata buffer #212

Open kleinschmidt opened 1 year ago

kleinschmidt commented 1 year ago

This was observed during some internal Beacon benchmarking of a large job (~25k tasks + a reduce step) with kuberay. I haven't been able to reproduce (re-submitting exactly the same job with no changes has been running smoothly).

Ray.get here is being called in the context of an async reduction step. The tasks being reduced over are generated via map a task that returns a DataFrame, and then reduced like this:

    out = DataFrame()
    progress = Progress(length(ray_objects); desc="Assembling segments...", showspeed=true)
    @sync for object in ray_objects
        # fetch objects as they are ready (may be out of order)
        @async begin
            result = Ray.get(object)
            append!(out, result; cols=:union)
            next!(progress)
        end
    end

My hunch is that there may be some kind of race condition here, where the julia async Tasks are somehow yielding in such a way as to cause the underlying memory to be freed. But really only a hunch. Full stacktrace from the segfault is below. The other thing I could think of off the top of my head is that there's something we're not handling around non-localmemorybuffer buffers, but it's hard to say. The mysterious thing is that we're basically only interacting with the metadata via CoreWorker API methods (gate a call to GetMetadata behind a call to HasMetadata:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/src/ray_julia_jll/common.jl#L258-L267

both of which are directly wrapping the C++ methods:

https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/7f3aec0a39b6db86356ad658ab89d29531303caa/build/wrapper.cc#L649-L661

Logs ``` Assembling segments... 4%|▋ | ETA: 1:51:31 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:50:47 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:50:04 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:49:17 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:48:35 ( 0.28 s/it) Assembling segments... 4%|▋ | ETA: 1:52:43 ( 0.29 s/it) Assembling segments... 5%|▊ | ETA: 1:43:25 ( 0.27 s/it) Assembling segments... 5%|▊ | ETA: 1:40:53 ( 0.26 s/it) Assembling segments... 5%|▊ | ETA: 1:40:15 ( 0.26 s/it) [4357] signal (11.1): Segmentation fault in expression starting at /tmp/ray/session_2023-10-20_14-17-46_414642_1/runtime_resources/working_dir_files/_ray_pkg_47145e2793bb888e/migration.jl:7 _ZNSt17_Function_handlerIFPhRKN3ray6BufferEEZN5jlcxx11TypeWrapperIS2_E6methodIS0_S2_JEEERS8_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEMT0_KFT_DpT1_EEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_ at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line) _ZN5jlcxx6detail11CallFunctorIPhJRKN3ray6BufferEEE5applyEPKvNS_13WrappedCppPtrE at /usr/local/share/julia-depot/ab14e38af3/dev/Ray/build/bin/julia_core_worker_lib.so (unknown line) Data at /usr/local/share/julia-depot/ab14e38af3/packages/CxxWrap/aXNBY/src/CxxWrap.jl:624 [inlined] take! at /usr/local/share/julia-depot/4c99afcd78/dev//dev/Ray/src/ray_julia_jll/common.jl:284 get_metadata at /usr/local/share/julia-depot/4c99afcd78/dev//dev/Ray/src/ray_julia_jll/common.jl:263 deserialize_from_ray_object at /usr/local/share/julia-depot/4c99afcd78/dev//dev/Ray/src/ray_serializer.jl:117 get at /usr/local/share/julia-depot/4c99afcd78/dev//dev/Ray/src/object_store.jl:35 macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev//src/v0_4_4.jl:172 [inlined] #12 at ./task.jl:514 unknown function (ip: 0x7fce32938d6f) _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940 jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined] start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092 Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73 Segmentation fault (core dumped) --------------------------------------- Job 'raysubmit_uYMShKRzc8EEcqKJ' failed --------------------------------------- Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars): get at /usr/local/share/julia-depot/4c99afcd78/dev//dev/Ray/src/object_store.jl:35 macro expansion at /usr/local/share/julia-depot/4c99afcd78/dev//src/v0_4_4.jl:172 [inlined] #12 at ./task.jl:514 unknown function (ip: 0x7fce32938d6f) _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940 jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined] start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092 Allocations: 555720146 (Pool: 555458877; Big: 261269); GC: 73 Segmentation fault (core dumped) ```
omus commented 1 year ago

@kleinschmidt this was using Ray.jl v0.0.2 right?

kleinschmidt commented 1 year ago

It was on 7f3aec0a39b6db86356ad658ab89d29531303caa (the commit before teh workflows landed), and everything built from source in the docker image (using the Ray.jl-provided dockerfile as the base)

kleinschmidt commented 1 year ago

Just hit this again, same code, same circusmstances (just started a fresh cluster and submitted job, got through ~4% of the work)

kleinschmidt commented 1 year ago

...and again

kleinschmidt commented 1 year ago

Can reproduce this pretty regularly (unfortunately not in a way that I can share since it's internal data).

I also ruled out the async reducer as the root cause; using a channel to make sure the gets happen synchronously didn't fix it.

I did a bit of poking around at whether we can check isnull(buffer[]) in take! as a guard against this, but when buffer[] is null pointer, we get the RayBuffer deleted error, not a segfault (at least on my M1 mac; maybe manifests differently on linux where I was running this...)

The next step for debugging this would probably be to just print the entire bytes during deserialization if HasMetadata returns true to see if we can reconstruct what's going on in a self-contained way.