beacon-biosignals / Ray.jl

Julia API for Ray
Other
9 stars 1 forks source link

Use GCSClient instead of PythonGCSClient #211

Closed glennmoy closed 10 months ago

glennmoy commented 11 months ago

Break off of https://github.com/beacon-biosignals/Ray.jl/pull/202

Closes #76

Two things worth calling out:

codecov[bot] commented 11 months ago

Codecov Report

Merging #211 (9bb6921) into main (caddc0e) will increase coverage by 0.46%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
+ Coverage   97.65%   98.12%   +0.46%     
==========================================
  Files          12       12              
  Lines         640      639       -1     
==========================================
+ Hits          625      627       +2     
+ Misses         15       12       -3     
Flag Coverage Δ
Ray.jl 98.12% <100.00%> (+0.46%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/function_manager.jl 97.77% <100.00%> (+5.47%) :arrow_up:
src/ray_julia_jll/common.jl 95.30% <100.00%> (+0.19%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

omus commented 10 months ago

Failures definitely look related:

 [ Info: Connecting function manager to GCS at 10.0.0.105:6379...
terminate called after throwing an instance of 'std::runtime_error'
  what():  NotFound: Failed to find the key

signal (6): Aborted
in expression starting at /home/runner/work/Ray.jl/Ray.jl/test/ray_serializer.jl:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)

https://github.com/beacon-biosignals/Ray.jl/actions/runs/6645852901/job/18058052127?pr=211#step:12:215

 [ Info: Connecting function manager to GCS at 127.0.0.1:6379...
libc++abi: terminating due to uncaught exception of type std::runtime_error: NotFound: Failed to find the key

signal (6): Abort trap: 6
in expression starting at /Users/runner/work/Ray.jl/Ray.jl/test/ray_serializer.jl:81
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 44751822 (Pool: 44725417; Big: 26405); GC: 51

https://github.com/beacon-biosignals/Ray.jl/actions/runs/6645852901/job/18058052370?pr=211#step:12:211

omus commented 10 months ago

I can manage to reproduce this on my local machine:

┌ Debug: Skipping registering ownership for ObjectRef("00ffffffffffffffffffffffffffffffffffffff4300000013000000") as object has known owner
└ @ Ray ~/.julia/dev/Ray/src/object_ref.jl:128
libc++abi: terminating due to uncaught exception of type std::runtime_error: NotFound: Failed to find the key

[1307] signal (6): Abort trap: 6
in expression starting at /Users/cvogt/.julia/dev/Ray/test/runtests.jl:19
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 30846806 (Pool: 30821062; Big: 25744); GC: 44
ERROR: Package Ray errored during testing (received signal: 6)
omus commented 10 months ago

The issue noticed above was just due to us failing to Disconnect the JuliaGcsClient used in the function_manager.jl tests. I've added a destructor to the JuliaGcsClient to automatically disconnect the client to avoid the SIGABRT failure. However, since users of JuliaGcsClient should really be handling disconnecting themselves I added in a STDERR warning from C++ indicating that a connected JuliaGcsClient failed to disconnect on their own. As Ray.jl users aren't exposed to this class mostly this is just a reminder for Ray.jl developers to be explicit about disconnecting.

glennmoy commented 10 months ago

The issue noticed above was just due to us failing to Disconnect the JuliaGcsClient used in the function_manager.jl tests. I've added a destructor to the JuliaGcsClient to automatically disconnect the client to avoid the SIGABRT failure. However, since users of JuliaGcsClient should really be handling disconnecting themselves I added in a STDERR warning from C++ indicating that a connected JuliaGcsClient failed to disconnect on their own. As Ray.jl users aren't exposed to this class mostly this is just a reminder for Ray.jl developers to be explicit about disconnecting.

yeah when I removed the finalizer from the inner constructor as per your suggestion above I should have done so by reverting https://github.com/beacon-biosignals/Ray.jl/pull/211/commits/6b98717caa778604f27918461dc30f6cc7b3d9bf as I missed adding back the Disconnect