Closed omus closed 12 months ago
Happened again against main
(6d019f5a5bc9c18497d25169a047009d00a8b82c):
Random theory: maybe the Bazel cache is behaving badly in that it uses the incorrect version of the shared library?
– https://github.com/beacon-biosignals/Ray.jl/actions/runs/6384165745/job/17327409543
suspicious that our old friend RegisterOwnershipInfoAndResolveFuture
shows up in teh stack trace, I'll look at the others as well
this one seems to be mangling the owner address somehow (27 bytes instead of 28, maybe a string conversion issue?) https://github.com/beacon-biosignals/Ray.jl/issues/177#issuecomment-1745342806
Random theory: maybe the Bazel cache is behaving badly in that it uses the incorrect version of the shared library?
I don't remember seeing anything like this before the sub-package/sub-module shuffle, but then again the source hasn't changed since then so I don't really know how that could be responsible. I kinda find it hard to believe that we just were getting lucky though...
Random theory: maybe the Bazel cache is behaving badly in that it uses the incorrect version of the shared library?
In #179 I changed the cache names and also fixed an issue where the hashFiles
check wasn't being used. If this was just a bad Bazel cache issue I wouldn't expect this issue to have persisted past that PR. Also, as re-running the jobs pass and those would be based off of the same cache information I'm doubtful this theory is correct anymore.
https://github.com/beacon-biosignals/Ray.jl/actions/runs/6436720204/job/17480614493?pr=186#step:12:210 suggests that the worker is dying for some reason (error code 0
is WORKER_DIED
)
...which makes sense because we see a RAY_CHECK
failure right after:
[2023-10-06 21:43:21,492 C 8415 8415] id_def.h:25: Check failed: binary.size() == Size() || binary.size() == 0 expected size is 28, but got data ��"��mr,�+�|��y~ of size 20
There are two things I can think of off the top of my head that might be causing this:
So maybe we should at least temporarily add some more println debugging to print out the address string before we try to register ownership?
Had a breakthrough with #189:
Error During Test at /home/runner/work/Ray.jl/Ray.jl/test/task.jl:40
Test threw exception
Expression: Ray.get(return_ref) == remote_ref
Encountered unhandled metadata from `ObjectRef("f4402ec78d3a2607ffffffffffffffffffffffff0100000001000000")`: 0
– https://github.com/beacon-biosignals/Ray.jl/actions/runs/6437112800/job/17481699838?pr=189
The 0
indicates an ErrorType
of WORKER_DIED
.
It occurred to me that with the latest discovery we should definitely upload the /tmp/ray/session_latest
CI logs to an Artifact for further analysis. I'll work on a PR to put that in place.
Finally got a failure again after #192 was merged which now lets us inspect the server side logs: https://github.com/beacon-biosignals/Ray.jl/actions/runs/6474941536
Interesting parts from the backend logs. The specific worker that falied (PID 6021) didn't report anything interesting.
yeah, this is the kinda message I've been seeing in the other failed jobs' logs:
[2023-10-10 21:22:40,279 C 6021 6021] id_def.h:25: Check failed: binary.size() == Size() || binary.size() == 0 expected size is 28, but got data [<AE>^F<D1>e⊇w<BB>
::t<87>D<8D> of size 16
somehow the owner address is still getting mangled...
Calling out that it is very likely that #203 fixes this issue however we were unable to reliably reproduce the original CI problem. If this problem is noticed again feel free to re-open this issue.
Saw this failure while working on #176 for Julia 1.9.3:
– https://github.com/beacon-biosignals/Ray.jl/actions/runs/6354776658/job/17261813047?pr=176