spawning second task fails silently

kleinschmidt commented 1 year ago

Here's the behavior I've observed: from julia, I call submit_task once, which returns an object id. then we can get that object which contains the function return value as expected. when I call submit_task a second time, and try to get the object ID, it deadlocks julia (doesn't respond to Ctrl-C/SIGINT), presumably because it's waiting in the C++ code for a response from the object store.

The desired behavior is for teh second (and nth, more generally) task submission to behave like the first.

Okay, some more details/breadcrumbs:

The raylet.err file shows that it keeps trying to spawn workers to handle this task. That's surprising to me given that (I'd have thought) the worker that executed the first task would be available again. Here's an example of the log messages that seem relevant:

raylet.err: Initial worker startup

``` 2023-08-18 10:51:54,670 DEBUG context.py:46 -- deserialize: {'command_prefix': [], 'env_vars': {'JULI A_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'}, 'py_executable': '/User s/dkleinschmidt/work/ray/venv/bin/python3.11', 'resources_dir': None, 'container': {}, 'java_jars': [ ]} 2023-08-18 10:51:54,670 DEBUG context.py:51 -- Worker context env: {'JULIA_PROJECT': '/Users/dkleinsc hmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'} 2023-08-18 10:51:54,670 DEBUG context.py:99 -- Exec'ing worker with command: julia -e 'using Ray; sta rt_worker()' -- /Users/dkleinschmidt/work/ray/python/ray/cpp/default_worker --ray_plasma_store_socket _name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/plasma_store --ray_raylet_socket_name =/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/raylet --ray_node_manager_port=52301 --ray _address=127.0.0.1:6379 --ray_redis_password= --ray_session_dir=/tmp/ray/session_2023-08-18_10-51-24_ 326840_81975 --ray_logs_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/logs --ray_node_ip_addr ess=127.0.0.1 --runtime-env-hash=-916701441 [2023-08-18 10:51:57,430 I 83708 2543240] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 83708 [2023-08-18 10:51:57,432 I 83708 2543240] io_service_pool.cc:35: IOServicePool is running with 1 io_s ervice. [2023-08-18 10:51:57,433 I 83708 2543240] grpc_server.cc:140: worker server started, listening on por t 10003. [2023-08-18 10:51:57,434 I 83708 2543240] core_worker.cc:217: Initializing worker at address: 127.0.0 .1:10003, worker ID 8bd9e2c2e6595bd8fbbb05213112515d3be4fa475a6fecb188b0cda4, raylet 2094214605657fee 83bc844f06d1453a7edb0c72254af0131509a745 [2023-08-18 10:51:57,434 I 83708 2543240] task_event_buffer.cc:184: Reporting task events to GCS ever y 1000ms. [2023-08-18 10:51:57,435 I 83708 2543292] accessor.cc:611: Received notification for node id = 209421 4605657fee83bc844f06d1453a7edb0c72254af0131509a745, IsAlive = 1 [2023-08-18 10:51:57,435 I 83708 2543292] core_worker.cc:4134: Number of alive nodes:1 [2023-08-18 10:51:57,435 I 83708 2543240] core_worker.cc:605: Adjusted worker niceness to 15 ```

raylet.err: Second worker startup logs

``` 2023-08-18 10:52:52,265 DEBUG context.py:46 -- deserialize: {'command_prefix': [], 'env_vars': {'JULIA_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'}, 'py_executable': '/Users/dkleinschmidt/work/ray/venv/bin/python3.11', 'resources_dir': None, 'container': {}, 'java_jars': []} 2023-08-18 10:52:52,265 DEBUG context.py:51 -- Worker context env: {'JULIA_PROJECT': '/Users/dkleinschmidt/work/ray_core_worker_julia_jll.jl/Ray.jl'} 2023-08-18 10:52:52,265 DEBUG context.py:99 -- Exec'ing worker with command: julia -e 'using Ray; start_worker()' -- /Users/dkleinschmidt/work/ray/python/ray/cpp/default_worker --ray_plasma_store_socket_name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/plasma_store --ray_raylet_socket_name=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/sockets/raylet --ray_node_manager_port=52301 --ray_address=127.0.0.1:6379 --ray_redis_password= --ray_session_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975 --ray_logs_dir=/tmp/ray/session_2023-08-18_10-51-24_326840_81975/logs --ray_node_ip_address=127.0.0.1 --runtime-env-hash=-916701441 [2023-08-18 10:52:55,112 I 86310 2547502] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 86310 [2023-08-18 10:52:55,113 I 86310 2547502] io_service_pool.cc:35: IOServicePool is running with 1 io_service. [2023-08-18 10:52:55,114 I 86310 2547502] grpc_server.cc:140: worker server started, listening on por t 10004. [2023-08-18 10:52:55,115 I 86310 2547502] core_worker.cc:217: Initializing worker at address: 127.0.0 .1:10004, worker ID 02fa3aacc11b0a32e167a159ee3371fe09980eabf9418ffc3ebb27da, raylet 2094214605657fee 83bc844f06d1453a7edb0c72254af0131509a745 [2023-08-18 10:52:55,116 I 86310 2547502] task_event_buffer.cc:184: Reporting task events to GCS ever y 1000ms. [2023-08-18 10:52:55,116 I 86310 2548799] accessor.cc:611: Received notification for node id = 209421 4605657fee83bc844f06d1453a7edb0c72254af0131509a745, IsAlive = 1 [2023-08-18 10:52:55,116 I 86310 2548799] core_worker.cc:4134: Number of alive nodes:1 [2023-08-18 10:52:55,116 I 86310 2547502] core_worker.cc:605: Adjusted worker niceness to 15 [2023-08-18 10:52:55,116 I 86310 2548799] core_worker.cc:553: Event stats: [2023-08-18 10:52:57,436 I 83708 2543292] core_worker.cc:553: Event stats: [2023-08-18 10:53:05,155 W 86310 2548794] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This wo n't affect Ray, but you can lose metrics from the cluster. [2023-08-18 10:53:52,029 E 81980 2540600] (raylet) worker_pool.cc:544: Some workers of the worker pro cess(86309) have not registered within the timeout. The process is still alive, probably it's hanging during start. ```

That basic pattern then repeats over and over again, with new julia processes being spawned every minute or so. The PIDs recorded in the log filenames are still live:

➜ pgrep julia
32071
32072
38188
42521
45986
48584
48585

➜ ls
dashboard.err                   julia_worker_38188.log          old
dashboard.log                   julia_worker_42521.log          ray_client_server.err
dashboard_agent.log             julia_worker_45986.log          ray_client_server.out
debug_state.txt                 julia_worker_48585.log          raylet.err
debug_state_gcs.txt             log_monitor.err                 raylet.out
events                          log_monitor.log                 runtime_env_agent.log
gcs_server.err                  monitor.err                     runtime_env_setup-02000000.log
gcs_server.out                  monitor.log
julia_worker_32072.log          monitor.out

However, digging around a bit more, the getpid() - 1 processes are all the juliaup launcher scripts; only the most recently spawned and original workers still have their launcher processes around as well, and the others are all zombies. This is a bit confusing to me since killing the juliaup launcher does seem to kill the actual julia process it spawns...but maybe the raylet is doing something else to tell the worker processes to shutdown that kills the juliaup launcher but leaves the others as zombies...

So one possibility is that we're somehow not indicating that the worker is no longer being used. I kind of thought that returning Status::OK() would be sufficient for that, since AFAICT that's all that the python workers do.

I'm pretty sure that there isn't some kind of runtime error on the workers, (those show up on raylet.err and cause the process to die), and with a bit of extra println debugging it looks like the initialize_coreworker_worker function is in fact executing the CoreWorkerProcess::Initialize and CoreWorkerProcess::RunTaskExecutionLoop as usual. But the task_execution_callback doens't seem to be getting called at all.

My current best guess is that there's some kind of misconfiguration we're doing in initializing the worker and/or failing to release the worker after we're done executing a task.

glennmoy commented 1 year ago

My current best guess is that there's some kind of misconfiguration we're doing in initializing the worker and/or failing to release the worker after we're done executing a task.

I think this makes sense - there's a bunch of comments around teh core worker code about shutting down workers properly.

i.e. https://github.com/beacon-biosignals/ray/blob/a03efd9931128d387649dd48b0e4864b43d3bfb4/src/ray/core_worker/core_worker.h#L294C2-L301C74

 /// Core worker's deallocation lifecycle
  ///
  /// Shutdown API must be called before deallocating a core worker.
  /// Otherwise, it can have various destruction order related memory corruption.
  ///
  /// If the core worker is initiated at a driver, the driver is responsible for calling
  /// the shutdown API before terminating. If the core worker is initated at a worker,
  /// shutdown must be called before terminating the task execution loop.

and https://github.com/beacon-biosignals/ray/blob/a03efd9931128d387649dd48b0e4864b43d3bfb4/src/ray/core_worker/core_worker_process.h#L43C1-L59C64

/// To shutdown a worker in the current process, return a system exit status (with status
/// code `IntentionalSystemExit` or `UnexpectedSystemExit`) in the task execution
/// callback.
///
/// How does core worker process dealloation work?
///
/// For an individual core worker thread's shutdown process, please check core_worker.h.
/// For the core worker process, it has 2 ways to properly shutdown the worker.
///
/// If it is a driver:
///     If the core worker is initialized at a driver, ShutdownDriver must be called.
///     before deallocating the core worker process.
///
/// If it is a worker:
///    If the core worker is initialized at a worker, it is expected to be shutdown
///    when the task execution loop is terminated from each core worker instance.
///    Core worker ensures this by having a strong check there.

kleinschmidt commented 1 year ago

here's another breadcrumb: the other language runtimes get a "startup token" via command line arg https://github.com/beacon-biosignals/ray/blob/7ad1f47a9c849abf00ca3e8afc7c3c6ee54cda43/src/ray/raylet/worker_pool.cc#L329-L337

we're setting it to 0 every time: https://github.com/beacon-biosignals/ray_core_worker_julia_jll.jl/blob/9126de8f614c342a575abdc0e56d96dc88149c3f/deps/wrapper.cc#L63

kleinschmidt commented 1 year ago

AHA that seems to do it...

omus commented 1 year ago

Was fixed by they two above PRs

beacon-biosignals / Ray.jl

spawning second task fails silently #30