adap / flower

Flower: A Friendly Federated AI Framework
https://flower.ai
Apache License 2.0
5.08k stars 875 forks source link

ValueError: ActorPool is empty. Stopping Simulation. Check 'client_resources' passed to `start_simulation` #4138

Closed nguyenhongson1902 closed 1 week ago

nguyenhongson1902 commented 2 months ago

Describe the bug

I've used start_simulation to simulate a federated learning task on MNIST and run the script on a Ubuntu machine with 6 GPUs, 24 GB each (see the first image below). I realize when all GPUs are empty, my code works well but when the GPU3 has already taken place, I see this bug (see the second image). I've tried to tweak the client_resources parameter but it doesn't work. Is it a bug? What should I do to fix it? Thank you so much!

Note: I refer to the tutorial code at this link

image

image

Steps/Code to Reproduce

Run the python script: python main.py num_rounds=3

Expected Results

Example:

Error executing job with overrides: ['num_rounds=3']
Traceback (most recent call last):
  File "/home/ubuntu/son.nh/venomancer_flower/main.py", line 47, in main
    history = fl.simulation.start_simulation(
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/app.py", line 289, in start_simulation
    pool = VirtualClientEngineActorPool(
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 175, in __init__
    num_actors = pool_size_from_resources(client_resources)
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 132, in pool_size_from_resources
    raise ValueError(
ValueError: ActorPool is empty. Stopping Simulation. Check 'client_resources' passed to `start_simulation`

Actual Results

Error executing job with overrides: ['num_rounds=3']
Traceback (most recent call last):
  File "/home/ubuntu/son.nh/venomancer_flower/main.py", line 47, in main
    history = fl.simulation.start_simulation(
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/app.py", line 289, in start_simulation
    pool = VirtualClientEngineActorPool(
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 175, in __init__
    num_actors = pool_size_from_resources(client_resources)
  File "/home/ubuntu/miniconda3/envs/pytorch-gpu/lib/python3.10/site-packages/flwr/simulation/ray_transport/ray_actor.py", line 132, in pool_size_from_resources
    raise ValueError(
ValueError: ActorPool is empty. Stopping Simulation. Check 'client_resources' passed to `start_simulation`
jafermarq commented 2 months ago

Hi @nguyenhongson1902, that tutorial is a bit outdated... Could you indicate what client_resources you pass to start_simulation? Recall that if your system has 6 GPUs, if you want one client per GPU, you'd nee to set num_gpus=1, with num_gpus=0.25 4xclients will run in each GPU and, since you have 6 GPUs, a total of 24 clients could be runnning in paralle. However, this is true if and only if you have enough CPUs to support it. The number of cpus that get reserved per client is set via the num_cpus option. So following the previous example if you set num_cpus=1, num_gpus=0.25 you'll be able to have 4x6 clients in training in parallel if you have 24 CPUs (which i presume your server has).

Note that Flower has been updated considerably since then (and we are launching new tutorials soon). Let me point you to newer examples:

Note that for the latter you'd need to edit the pyproject.toml and indicate the client_resources as an option to your backend. You can follow how it's done for the flowertune-vit example i mention above (in particular see the local-simulation-gpu federation defined in its pyproject.toml.)

I hope this helps! Ping us if you have some questions

jafermarq commented 1 week ago

Hi @nguyenhongson1902 , were you able to solve it? should we close this issue then?

nguyenhongson1902 commented 1 week ago

Hi @nguyenhongson1902 , were you able to solve it? should we close this issue then?

Hi @jafermarq. Thank you for your explanation of how the num_cpus and num_gpus work. Yes, I have solved the issue by setting up the CUDA_VISIBLE_DEVICES in order to specify the GPUs that I want to use at a run. Btw, have you guys had a new tutorial video on that newest Flower version yet?

jafermarq commented 1 week ago

@nguyenhongson1902 very soon actually! I'm planning recording an update to the simulation series sometime next week. Are there some points in particular you'd find interesting to include?

nguyenhongson1902 commented 1 week ago

@nguyenhongson1902 very soon actually! I'm planning recording an update to the simulation series sometime next week. Are there some points in particular you'd find interesting to include?

Thanks for your quick response. I'm very interested in a few things:

Hope to see all that in the next week.

jafermarq commented 1 week ago

I'll add those to the topics to cover. Many thanks for bringing these up 💯 . I'll close this issue now. Happy to continue the conversation over slack or our https://discuss.flower.ai/ 🙌