AIoT-MLSys-Lab / FedRolex

[NeurIPS 2022] "FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction" by Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang
Apache License 2.0
57 stars 15 forks source link

Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 0.15}. Add suitable node types to this cluster to resolve this issue. #2

Closed Sherrylife closed 1 year ago

Sherrylife commented 1 year ago

Hi @samiul272 , I downloaded your code and ran the commands pip install -r requirements.txt pip install tensorboard Then, I ran the commands python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2 However, there were some error: image Since I am not familiar with the framework of Ray, can you help me how to solve this error?

samiul272 commented 1 year ago

Hi, What GPU are you using? The restriction GPU:0.15 is not always maintained by ray. It seems 10 clients are not fitting the resources of your system. Try using 0.05 as the participation rate to confirm. You can follow the code in stackoverflow to see how to make changes for cases where the total number of clients do not fit in available vram.

On Fri, Feb 17, 2023, 9:08 AM Sherrylife @.***> wrote:

Hi @samiul272 https://github.com/samiul272 , I downloaded your code and ran the commands pip install -r requirements.txt pip install tensorboard Then, I ran the commands python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2 However, there were some error, since I am not familiar with the framework of ray, can you help me? [image: image] https://user-images.githubusercontent.com/60345931/219677522-dca3a035-cf85-4d62-a279-ac7556967632.png

— Reply to this email directly, view it on GitHub https://github.com/AIoT-MLSys-Lab/FedRolex/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH34XDK3RPJJZGBMPX6KJFLWX6A6DANCNFSM6AAAAAAU7PDMZ4 . You are receiving this because you were mentioned.Message ID: @.***>

Sherrylife commented 1 year ago

Hi, @samiul272, thanks for your reply. My GPU type is GeForce RTX 3090, and the GPU resources on my server are as follows: image Even if I changed the client participation ratio to 0.01, my code still reported the same error: image

Sherrylife commented 1 year ago

Hi, @samiul272 , I have successfully run your code, and the extra thing I did were changing ray.init() to ray.init(num_gpus=5) and using all GPUs to train by setting --devices 0 1 2 3 4. It looks like I don't have enough GPU resources to run the default code. image

samiul272 commented 1 year ago

Awesome, thanks for letting me know! I will try and check why it runs at my end without any errors but fails in your setup and push changes if needed.

On Sat, Feb 18, 2023, 1:50 AM Sherrylife @.***> wrote:

Hi, @samiul272 https://github.com/samiul272 , I have successfully run your code, and the extra thing I did was changing ray.init() to ray.init(num_gpus=3).

— Reply to this email directly, view it on GitHub https://github.com/AIoT-MLSys-Lab/FedRolex/issues/2#issuecomment-1435501874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH34XDND4XFX3654EJYZ4ATWYBWNFANCNFSM6AAAAAAU7PDMZ4 . You are receiving this because you were mentioned.Message ID: @.***>

Sherrylife commented 1 year ago

Thank you again for your patient response.