Closed Guncuke closed 5 months ago
another question is, when i use dirichlet with alpha=0.3, client_num=500, it will be become very slow, it's normal or not?
Thanks for your attention to this project first.
when I train with pfedla (cifar 10 , 500 clients , diriclet 1) I find that the CPU is more than 10000%, but GPU is 3%. The setting is serial and default setting, use_cuda = True.
pfedla needs to store a set of parameters for each client (personal model and hypernetwork), and the aggregation at server side involves cpu only.
On the other hand, because you split cifar10 into 500 pieces, every piece can be small. Client local training on a small data shard lead to low gpu load. With your FL 500 clients settings. There are 500 sets of parameters store on the memory and the aggregation operation at server side needs to load 500 set of model parameter to memory and which can cause very high CPU load.
On the other hand, when you split the cifar10 into 500 pieces, each pieces are small and when client trains on a such small piece of data, the GPU load is low.
another question is, when i use dirichlet with alpha=0.3, client_num=500, it will be become very slow, it's normal or not?
When you set a big client_num and small alpha, is hare to fullfill the target dirichlet distribution condition than small client_num same alpha. So in your situation, --client_num 500 --alpha 0.3
, the program may keep generating random number for fullfill the condition. You can try set --least_sample
smaller, --alpha
bigger or --client_num
smaller.
thanks!
thanks for you great work! when I train with pfedla (cifar 10 , 500 clients , diriclet 1) I find that the CPU is more than 10000%, but GPU is 3%. The setting is serial and default setting, use_cuda = True.