Closed kleinschmidt closed 1 year ago
My current best guess is that there's some kind of misconfiguration we're doing in initializing the worker and/or failing to release the worker after we're done executing a task.
I think this makes sense - there's a bunch of comments around teh core worker code about shutting down workers properly.
/// Core worker's deallocation lifecycle
///
/// Shutdown API must be called before deallocating a core worker.
/// Otherwise, it can have various destruction order related memory corruption.
///
/// If the core worker is initiated at a driver, the driver is responsible for calling
/// the shutdown API before terminating. If the core worker is initated at a worker,
/// shutdown must be called before terminating the task execution loop.
/// To shutdown a worker in the current process, return a system exit status (with status
/// code `IntentionalSystemExit` or `UnexpectedSystemExit`) in the task execution
/// callback.
///
/// How does core worker process dealloation work?
///
/// For an individual core worker thread's shutdown process, please check core_worker.h.
/// For the core worker process, it has 2 ways to properly shutdown the worker.
///
/// If it is a driver:
/// If the core worker is initialized at a driver, ShutdownDriver must be called.
/// before deallocating the core worker process.
///
/// If it is a worker:
/// If the core worker is initialized at a worker, it is expected to be shutdown
/// when the task execution loop is terminated from each core worker instance.
/// Core worker ensures this by having a strong check there.
here's another breadcrumb: the other language runtimes get a "startup token" via command line arg https://github.com/beacon-biosignals/ray/blob/7ad1f47a9c849abf00ca3e8afc7c3c6ee54cda43/src/ray/raylet/worker_pool.cc#L329-L337
we're setting it to 0 every time: https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/9126de8f614c342a575abdc0e56d96dc88149c3f/deps/wrapper.cc#L63
AHA that seems to do it...
Was fixed by they two above PRs
Here's the behavior I've observed: from julia, I call
submit_task
once, which returns an object id. then we canget
that object which contains the function return value as expected. when I callsubmit_task
a second time, and try toget
the object ID, it deadlocks julia (doesn't respond to Ctrl-C/SIGINT), presumably because it's waiting in the C++ code for a response from the object store.The desired behavior is for teh second (and nth, more generally) task submission to behave like the first.
Okay, some more details/breadcrumbs:
The
raylet.err
file shows that it keeps trying to spawn workers to handle this task. That's surprising to me given that (I'd have thought) the worker that executed the first task would be available again. Here's an example of the log messages that seem relevant:raylet.err: Initial worker startup
``` 2023-08-18 10:51:54,670 DEBUG context.py:46 -- deserialize: {'command_prefix': [], 'env_vars': {'JULI A_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'}, 'py_executable': '/User s/dkleinschmidt/work/ray/venv/bin/python3.11', 'resources_dir': None, 'container': {}, 'java_jars': [ ]} 2023-08-18 10:51:54,670 DEBUG context.py:51 -- Worker context env: {'JULIA_PROJECT': '/Users/dkleinsc hmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'} 2023-08-18 10:51:54,670 DEBUG context.py:99 -- Exec'ing worker with command: julia -e 'using Ray; sta rt_worker()' -- /Users/dkleinschmidt/work/ray/python/ray/cpp/default_worker --ray_plasma_store_socket _name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/plasma_store --ray_raylet_socket_name =/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/raylet --ray_node_manager_port=52301 --ray _address=127.0.0.1:6379 --ray_redis_password= --ray_session_dir=/tmp/ray/session_2023-08-18_10-51-24_ 326840_81975 --ray_logs_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/logs --ray_node_ip_addr ess=127.0.0.1 --runtime-env-hash=-916701441 [2023-08-18 10:51:57,430 I 83708 2543240] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 83708 [2023-08-18 10:51:57,432 I 83708 2543240] io_service_pool.cc:35: IOServicePool is running with 1 io_s ervice. [2023-08-18 10:51:57,433 I 83708 2543240] grpc_server.cc:140: worker server started, listening on por t 10003. [2023-08-18 10:51:57,434 I 83708 2543240] core_worker.cc:217: Initializing worker at address: 127.0.0 .1:10003, worker ID 8bd9e2c2e6595bd8fbbb05213112515d3be4fa475a6fecb188b0cda4, raylet 2094214605657fee 83bc844f06d1453a7edb0c72254af0131509a745 [2023-08-18 10:51:57,434 I 83708 2543240] task_event_buffer.cc:184: Reporting task events to GCS ever y 1000ms. [2023-08-18 10:51:57,435 I 83708 2543292] accessor.cc:611: Received notification for node id = 209421 4605657fee83bc844f06d1453a7edb0c72254af0131509a745, IsAlive = 1 [2023-08-18 10:51:57,435 I 83708 2543292] core_worker.cc:4134: Number of alive nodes:1 [2023-08-18 10:51:57,435 I 83708 2543240] core_worker.cc:605: Adjusted worker niceness to 15 ```raylet.err: Second worker startup logs
``` 2023-08-18 10:52:52,265 DEBUG context.py:46 -- deserialize: {'command_prefix': [], 'env_vars': {'JULIA_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'}, 'py_executable': '/Users/dkleinschmidt/work/ray/venv/bin/python3.11', 'resources_dir': None, 'container': {}, 'java_jars': []} 2023-08-18 10:52:52,265 DEBUG context.py:51 -- Worker context env: {'JULIA_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'} 2023-08-18 10:52:52,265 DEBUG context.py:99 -- Exec'ing worker with command: julia -e 'using Ray; start_worker()' -- /Users/dkleinschmidt/work/ray/python/ray/cpp/default_worker --ray_plasma_store_socket_name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/plasma_store --ray_raylet_socket_name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/raylet --ray_node_manager_port=52301 --ray_address=127.0.0.1:6379 --ray_redis_password= --ray_session_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975 --ray_logs_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/logs --ray_node_ip_address=127.0.0.1 --runtime-env-hash=-916701441 [2023-08-18 10:52:55,112 I 86310 2547502] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 86310 [2023-08-18 10:52:55,113 I 86310 2547502] io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2023-08-18 10:52:55,114 I 86310 2547502] grpc_server.cc:140: worker server started, listening on por t 10004. [2023-08-18 10:52:55,115 I 86310 2547502] core_worker.cc:217: Initializing worker at address: 127.0.0 .1:10004, worker ID 02fa3aacc11b0a32e167a159ee3371fe09980eabf9418ffc3ebb27da, raylet 2094214605657fee 83bc844f06d1453a7edb0c72254af0131509a745 [2023-08-18 10:52:55,116 I 86310 2547502] task_event_buffer.cc:184: Reporting task events to GCS ever y 1000ms. [2023-08-18 10:52:55,116 I 86310 2548799] accessor.cc:611: Received notification for node id = 209421 4605657fee83bc844f06d1453a7edb0c72254af0131509a745, IsAlive = 1 [2023-08-18 10:52:55,116 I 86310 2548799] core_worker.cc:4134: Number of alive nodes:1 [2023-08-18 10:52:55,116 I 86310 2547502] core_worker.cc:605: Adjusted worker niceness to 15 [2023-08-18 10:52:55,116 I 86310 2548799] core_worker.cc:553: Event stats: [2023-08-18 10:52:57,436 I 83708 2543292] core_worker.cc:553: Event stats: [2023-08-18 10:53:05,155 W 86310 2548794] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This wo n't affect Ray, but you can lose metrics from the cluster. [2023-08-18 10:53:52,029 E 81980 2540600] (raylet) worker_pool.cc:544: Some workers of the worker pro cess(86309) have not registered within the timeout. The process is still alive, probably it's hanging during start. ```That basic pattern then repeats over and over again, with new julia processes being spawned every minute or so. The PIDs recorded in the log filenames are still live:
However, digging around a bit more, the
getpid() - 1
processes are all the juliaup launcher scripts; only the most recently spawned and original workers still have their launcher processes around as well, and the others are all zombies. This is a bit confusing to me since killing the juliaup launcher does seem to kill the actual julia process it spawns...but maybe the raylet is doing something else to tell the worker processes to shutdown that kills the juliaup launcher but leaves the others as zombies...So one possibility is that we're somehow not indicating that the worker is no longer being used. I kind of thought that returning
Status::OK()
would be sufficient for that, since AFAICT that's all that the python workers do.I'm pretty sure that there isn't some kind of runtime error on the workers, (those show up on raylet.err and cause the process to die), and with a bit of extra println debugging it looks like the
initialize_coreworker_worker
function is in fact executing theCoreWorkerProcess::Initialize
andCoreWorkerProcess::RunTaskExecutionLoop
as usual. But thetask_execution_callback
doens't seem to be getting called at all.My current best guess is that there's some kind of misconfiguration we're doing in initializing the worker and/or failing to release the worker after we're done executing a task.