In an attempt to bring Ray.jl in alignment with upstream ray trunk, we replaced the "log file parsing" method of discovering connection parameters with the blessed method of using the GlobalStateAccessor. In the process we also swapped out the backend for the JuliaGCSClient from the PythonGCSClient in favor of the C++ GCSClient. Somewhere in this process, mysterious segfaults started to appear when running jobs on our kubernetes cluster (#226).
Given Beacon's current priorities, we can't dedicate the engineering effort to understanding the root cause of this, but still want to leave a release in a good state so at least our internal users can use Ray with some level of confidence. In order to do that, I propose we
roll back these changes from
211
214
225
merge in-flight PRs
[x] #215
[ ] #221
re-test everything to make sure it all still works!
CI
local machine benchmark workload (beacon-internal)
k8s cluster benchmark workload (beacon-internal)
cut a release
file issues to follow-up on the root cause of GSA/GCS-related segfaults on k8s.
In an attempt to bring Ray.jl in alignment with upstream ray trunk, we replaced the "log file parsing" method of discovering connection parameters with the blessed method of using the GlobalStateAccessor. In the process we also swapped out the backend for the JuliaGCSClient from the PythonGCSClient in favor of the C++ GCSClient. Somewhere in this process, mysterious segfaults started to appear when running jobs on our kubernetes cluster (#226).
Given Beacon's current priorities, we can't dedicate the engineering effort to understanding the root cause of this, but still want to leave a release in a good state so at least our internal users can use Ray with some level of confidence. In order to do that, I propose we
211
214
225