Closed nguyenhongson1902 closed 1 week ago
Hi @nguyenhongson1902, that tutorial is a bit outdated... Could you indicate what client_resources
you pass to start_simulation
? Recall that if your system has 6 GPUs, if you want one client per GPU, you'd nee to set num_gpus=1
, with num_gpus=0.25
4xclients will run in each GPU and, since you have 6 GPUs, a total of 24 clients could be runnning in paralle. However, this is true if and only if you have enough CPUs to support it. The number of cpus that get reserved per client is set via the num_cpus
option. So following the previous example if you set num_cpus=1, num_gpus=0.25
you'll be able to have 4x6 clients in training in parallel if you have 24 CPUs (which i presume your server has).
Note that Flower has been updated considerably since then (and we are launching new tutorials soon). Let me point you to newer examples:
Note that for the latter you'd need to edit the pyproject.toml
and indicate the client_resources as an option to your backend. You can follow how it's done for the flowertune-vit
example i mention above (in particular see the local-simulation-gpu
federation defined in its pyproject.toml
.)
I hope this helps! Ping us if you have some questions
Hi @nguyenhongson1902 , were you able to solve it? should we close this issue then?
Hi @nguyenhongson1902 , were you able to solve it? should we close this issue then?
Hi @jafermarq. Thank you for your explanation of how the num_cpus
and num_gpus
work. Yes, I have solved the issue by setting up the CUDA_VISIBLE_DEVICES
in order to specify the GPUs that I want to use at a run. Btw, have you guys had a new tutorial video on that newest Flower version yet?
@nguyenhongson1902 very soon actually! I'm planning recording an update to the simulation series sometime next week. Are there some points in particular you'd find interesting to include?
@nguyenhongson1902 very soon actually! I'm planning recording an update to the simulation series sometime next week. Are there some points in particular you'd find interesting to include?
Thanks for your quick response. I'm very interested in a few things:
context
as wellHope to see all that in the next week.
I'll add those to the topics to cover. Many thanks for bringing these up 💯 . I'll close this issue now. Happy to continue the conversation over slack or our https://discuss.flower.ai/ 🙌
Describe the bug
I've used
start_simulation
to simulate a federated learning task on MNIST and run the script on a Ubuntu machine with 6 GPUs, 24 GB each (see the first image below). I realize when all GPUs are empty, my code works well but when the GPU3 has already taken place, I see this bug (see the second image). I've tried to tweak theclient_resources
parameter but it doesn't work. Is it a bug? What should I do to fix it? Thank you so much!Note: I refer to the tutorial code at this link
Steps/Code to Reproduce
Run the python script:
python main.py num_rounds=3
Expected Results
Example:
Actual Results