beacon-biosignals / Ray.jl

Julia API for Ray
Other
9 stars 1 forks source link

stochastic Deadline Exceeded error when running tests #29

Closed glennmoy closed 1 year ago

glennmoy commented 1 year ago

gets thrown from time to time, solution is to just rerun ] test

(Ray) pkg> test
     Testing Ray
     ...
     ...
Precompiling project...
  1 dependency successfully precompiled in 1 seconds. 23 already precompiled.
     Testing Running tests...
[ Info: Starting local head node
2023-08-18 14:41:43,622 DEBUG node.py:1149 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2023-08-18_14-41-43_620795_53189/logs.
2023-08-18 14:41:43,630 DEBUG node.py:605 -- Failed to send request to gcs, reconnecting. Error failed to connect to all addresses
2023-08-18 14:41:44,980 DEBUG node.py:1187 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2023-08-18_14-41-43_620795_53189/logs.
2023-08-18 14:41:44,981 DEBUG services.py:1960 -- Determine to start the Plasma object store with 2.15 GB memory using /tmp.
calling delete on: 0x1356e61d0
function manager: Error During Test at /Users/glenn/.julia/dev/ray_core_worker_julia_jll/Ray.jl/test/function_manager.jl:23
  Test threw exception
  Expression: wait_for_function(fm, mfd, jobid; timeout_s = 1) == :timed_out
  Unknown: Deadline Exceeded
  Stacktrace:
   [1] Exists(arg1::Union{JuliaGcsClient, CxxWrap.CxxWrapCore.CxxRef{<:JuliaGcsClient}}, arg2::Union{AbstractString, CxxWrap.CxxWrapCore.ConstCxxRef{<:CxxWrap.StdLib.StdString}, CxxWrap.CxxWrapCore.CxxRef{<:CxxWrap.StdLib.StdString}}, arg3::Union{AbstractString, CxxWrap.CxxWrapCore.ConstCxxRef{<:CxxWrap.StdLib.StdString}, CxxWrap.CxxWrapCore.CxxRef{<:CxxWrap.StdLib.StdString}}, arg4::Integer)
     @ ray_core_worker_julia_jll ~/.julia/packages/CxxWrap/aXNBY/src/CxxWrap.jl:624
   [2] #9
     @ ~/.julia/dev/ray_core_worker_julia_jll/Ray.jl/src/function_manager.jl:76 [inlined]
   [3] timedwait(testcb::Ray.var"#9#10"{Int64, FunctionManager, String}, timeout::Int64; pollint::Float64)
     @ Base ./asyncevent.jl:320
   [4] timedwait
     @ ./asyncevent.jl:311 [inlined]
   [5] #wait_for_function#8
     @ ~/.julia/dev/ray_core_worker_julia_jll/Ray.jl/src/function_manager.jl:73 [inlined]
   [6] macro expansion
     @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Test/src/Test.jl:478 [inlined]
   [7] macro expansion
     @ ~/.julia/dev/ray_core_worker_julia_jll/Ray.jl/test/function_manager.jl:23 [inlined]
   [8] macro expansion
     @ /Applications/Julia-1.9.app/Contents/Resources/julia/share/julia/stdlib/v1.9/Test/src/Test.jl:1498 [inlined]
   [9] top-level scope
     @ ~/.julia/dev/ray_core_worker_julia_jll/Ray.jl/test/function_manager.jl:2
kleinschmidt commented 1 year ago

I think that's triggered by timeout in the Exists call. probably want to catch and handle taht instead of rethrowing. I kinda figured that the timeout wouldn't throw an error but here we are.

kleinschmidt commented 1 year ago

cf. python tests: https://github.com/beacon-biosignals/ray/blob/7ad1f47a9c849abf00ca3e8afc7c3c6ee54cda43/python/ray/tests/test_gcs_utils.py#L152-L163

kleinschmidt commented 1 year ago

my hunch is that this is caused by the GCS server taking a bit of time to startup and the timeout being too short (1s).

omus commented 1 year ago

Fixed by #37