Closed Sherrylife closed 1 year ago
Hi, What GPU are you using? The restriction GPU:0.15 is not always maintained by ray. It seems 10 clients are not fitting the resources of your system. Try using 0.05 as the participation rate to confirm. You can follow the code in stackoverflow to see how to make changes for cases where the total number of clients do not fit in available vram.
On Fri, Feb 17, 2023, 9:08 AM Sherrylife @.***> wrote:
Hi @samiul272 https://github.com/samiul272 , I downloaded your code and ran the commands pip install -r requirements.txt pip install tensorboard Then, I ran the commands python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2 However, there were some error, since I am not familiar with the framework of ray, can you help me? [image: image] https://user-images.githubusercontent.com/60345931/219677522-dca3a035-cf85-4d62-a279-ac7556967632.png
— Reply to this email directly, view it on GitHub https://github.com/AIoT-MLSys-Lab/FedRolex/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH34XDK3RPJJZGBMPX6KJFLWX6A6DANCNFSM6AAAAAAU7PDMZ4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi, @samiul272, thanks for your reply. My GPU type is GeForce RTX 3090, and the GPU resources on my server are as follows: Even if I changed the client participation ratio to 0.01, my code still reported the same error:
Hi, @samiul272 , I have successfully run your code, and the extra thing I did were changing ray.init()
to ray.init(num_gpus=5)
and using all GPUs to train by setting --devices 0 1 2 3 4
. It looks like I don't have enough GPU resources to run the default code.
Awesome, thanks for letting me know! I will try and check why it runs at my end without any errors but fails in your setup and push changes if needed.
On Sat, Feb 18, 2023, 1:50 AM Sherrylife @.***> wrote:
Hi, @samiul272 https://github.com/samiul272 , I have successfully run your code, and the extra thing I did was changing ray.init() to ray.init(num_gpus=3).
— Reply to this email directly, view it on GitHub https://github.com/AIoT-MLSys-Lab/FedRolex/issues/2#issuecomment-1435501874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH34XDND4XFX3654EJYZ4ATWYBWNFANCNFSM6AAAAAAU7PDMZ4 . You are receiving this because you were mentioned.Message ID: @.***>
Thank you again for your patient response.
Hi @samiul272 , I downloaded your code and ran the commands
pip install -r requirements.txt
pip install tensorboard
Then, I ran the commandspython main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2
However, there were some error: Since I am not familiar with the framework of Ray, can you help me how to solve this error?