adap / flower

Flower: A Friendly Federated Learning Framework
https://flower.ai
Apache License 2.0
4.9k stars 844 forks source link

New VCE crashes if a Ray cluster contains nodes without CPU resources #2478

Open lbhm opened 11 months ago

lbhm commented 11 months ago

Describe the bug

The new VCE currently expects each Ray node to have CPU resources as long as it has any resources at all (https://github.com/adap/flower/blob/main/src/py/flwr/simulation/ray_transport/ray_actor.py#L128). This behavior is undesirable in a setting where a Ray cluster is managed by a head node that is not supposed to run tasks (and therefore started with --num-cpus 0).

Instead, the VCE should default to assuming zero CPU resources if the key is absent, similar to how GPU resources are queried (https://github.com/adap/flower/blob/main/src/py/flwr/simulation/ray_transport/ray_actor.py#L129).

Steps/Code to Reproduce

Expected Results

The job should suceed.

Actual Results

After connecting to the Ray cluster, the following error is raised:

Traceback (most recent call last):
  File "flower_sim/main.py", line 77, in main
    history = fl.simulation.start_simulation(
  File "venv/lib/python3.10/site-packages/flwr/simulation/app.py", line 242, in start_simulation
    pool = VirtualClientEngineActorPool(
  File "venv/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 184, in __init__
    num_actors = pool_size_from_resources(client_resources)
  File "venv/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 116, in pool_size_from_resources
    num_cpus = node_resources["CPU"]
KeyError: 'CPU'
jafermarq commented 11 months ago

Thanks for flagging this @lbhm ! Let’s think for a good way to solve this. I agree excluding clients from running on the head node makes sense in some settings