Describe the bug
The line 105 of cfl.py show below may lead to a divion by zero error that may occure if none of the models in the cluster have been updated.
# Full explaination are listed on README.md
mode: serial # [parallel, serial]
parallel: # It's fine to keep these configs.
# Go check doc of `https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html` for more details.
ray_cluster_addr: null # [null, auto, local]
# `null` implies that all cpus/gpus are included.
num_cpus: null
num_gpus: null
# should be set larger than 1, or training mode fallback to `serial`
# Set a larger `num_workers` can further boost efficiency, also let each worker have less computational resources.
num_workers: 2
common:
# [mnist, cifar10, cifar100, emnist, fmnist, femnist, medmnist, medmnistA, medmnistC, covid19, celeba, synthetic, svhn, tiny_imagenet, cinic10, domain]
dataset: cifar10
seed: 42
model: res18
join_ratio: 0.1
global_epoch: 100
local_epoch: 5
finetune_epoch: 0
batch_size: 32
test_interval: 100
straggler_ratio: 0
straggler_min_local_epoch: 0
external_model_params_file: null
buffers: local # [local, global, drop]
optimizer:
name: adam # [sgd, adam, adamw, rmsprop, adagrad]
lr: 0.0001
dampening: 0 # for SGD
weight_decay: 0
momentum: 0 # for [SGD, RMSprop]
alpha: 0.99 # for RMSprop
nesterov: false # for SGD
betas: [0.9, 0.999] # for [Adam, AdamW]
amsgrad: false # for [Adam, AdamW]
lr_scheduler:
name: null # [null, step, cosine, constant, plateau]
step_size: 10 # an arg example for setting step lr_scheduler
eval_test: true
eval_val: false
eval_train: false
verbose_gap: 10
visible: null # [null, visdom, tensorboard]
use_cuda: true
save_log: true
save_model: false
save_fig: true
save_metrics: true
delete_useless_run: true
# You can set specific arguments for FL methods also
# FL-bench uses FL method arguments by args.<method>.<arg>
# e.g.
fedprox:
mu: 0.01
pfedsim:
warmup_round: 0.7
# ...
# NOTE: For those unmentioned arguments, the default values are set in `get_hyperparams()` in `class <method>Server` in `src/server/<method>.py`
Describe the bug The line 105 of cfl.py show below may lead to a divion by zero error that may occure if none of the models in the cluster have been updated.
Fixing the bug To fix it, adding the two lines below just before line 104 should be enough.
It should work because in the case where no model have been updated there is nothing to be done in the cluster.
To Reproduce I got the error at the end of the trainig (around 70% done) when running those specific lines
where cifar10.yml is this: