Start a Ray cluster and pass --num-cpus 0 to at least one node in the cluster.
Run any flower job.
Expected Results
The job should suceed.
Actual Results
After connecting to the Ray cluster, the following error is raised:
Traceback (most recent call last):
File "flower_sim/main.py", line 77, in main
history = fl.simulation.start_simulation(
File "venv/lib/python3.10/site-packages/flwr/simulation/app.py", line 242, in start_simulation
pool = VirtualClientEngineActorPool(
File "venv/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 184, in __init__
num_actors = pool_size_from_resources(client_resources)
File "venv/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 116, in pool_size_from_resources
num_cpus = node_resources["CPU"]
KeyError: 'CPU'
Thanks for flagging this @lbhm ! Let’s think for a good way to solve this. I agree excluding clients from running on the head node makes sense in some settings
Describe the bug
The new VCE currently expects each Ray node to have CPU resources as long as it has any resources at all (https://github.com/adap/flower/blob/main/src/py/flwr/simulation/ray_transport/ray_actor.py#L128). This behavior is undesirable in a setting where a Ray cluster is managed by a head node that is not supposed to run tasks (and therefore started with
--num-cpus 0
).Instead, the VCE should default to assuming zero CPU resources if the key is absent, similar to how GPU resources are queried (https://github.com/adap/flower/blob/main/src/py/flwr/simulation/ray_transport/ray_actor.py#L129).
Steps/Code to Reproduce
--num-cpus 0
to at least one node in the cluster.Expected Results
The job should suceed.
Actual Results
After connecting to the Ray cluster, the following error is raised: