KarhouTam / FL-bench

Benchmark of federated learning. Dedicated to the community. 🤗
GNU General Public License v3.0
498 stars 80 forks source link

Division by zero on cfl.py #77

Closed LawrenceLeitgib closed 4 months ago

LawrenceLeitgib commented 4 months ago

Describe the bug The line 105 of cfl.py show below may lead to a divion by zero error that may occure if none of the models in the cluster have been updated.

104      weights = torch.ones(len(model_params_diff_list)) * (
105                    1 / len(model_params_diff_list)
106               )

Fixing the bug To fix it, adding the two lines below just before line 104 should be enough.

if(len(model_params_diff_list)) == 0 :
     continue

It should work because in the case where no model have been updated there is nothing to be done in the cluster.

To Reproduce I got the error at the end of the trainig (around 70% done) when running those specific lines

python generate_data.py -d cifar10 -a 0.1 -cn 100
python main.py cfl config/cifar10.yml

where cifar10.yml is this:

# Full explaination are listed on README.md

mode: serial # [parallel, serial]

parallel: # It's fine to keep these configs.
  # Go check doc of `https://docs.ray.io/en/latest/ray-core/api/doc/ray.init.html` for more details.
  ray_cluster_addr: null # [null, auto, local]

  # `null` implies that all cpus/gpus are included.
  num_cpus: null
  num_gpus: null

  # should be set larger than 1, or training mode fallback to `serial`
  # Set a larger `num_workers` can further boost efficiency, also let each worker have less computational resources.
  num_workers: 2

common:
  # [mnist, cifar10, cifar100, emnist, fmnist, femnist, medmnist, medmnistA, medmnistC, covid19, celeba, synthetic, svhn, tiny_imagenet, cinic10, domain]
  dataset: cifar10
  seed: 42
  model: res18
  join_ratio: 0.1
  global_epoch: 100
  local_epoch: 5
  finetune_epoch: 0
  batch_size: 32
  test_interval: 100
  straggler_ratio: 0
  straggler_min_local_epoch: 0
  external_model_params_file: null
  buffers: local # [local, global, drop]
  optimizer:
    name: adam # [sgd, adam, adamw, rmsprop, adagrad]
    lr: 0.0001
    dampening: 0 # for SGD
    weight_decay: 0
    momentum: 0 # for [SGD, RMSprop]
    alpha: 0.99 # for RMSprop
    nesterov: false # for SGD
    betas: [0.9, 0.999] # for [Adam, AdamW]
    amsgrad: false # for [Adam, AdamW]

  lr_scheduler:
    name: null # [null, step, cosine, constant, plateau]
    step_size: 10 # an arg example for setting step lr_scheduler

  eval_test: true
  eval_val: false
  eval_train: false

  verbose_gap: 10
  visible: null # [null, visdom, tensorboard]
  use_cuda: true
  save_log: true
  save_model: false
  save_fig: true
  save_metrics: true
  delete_useless_run: true

# You can set specific arguments for FL methods also
# FL-bench uses FL method arguments by args.<method>.<arg>
# e.g.
fedprox:
  mu: 0.01
pfedsim:
  warmup_round: 0.7
# ...

# NOTE: For those unmentioned arguments, the default values are set in `get_hyperparams()` in `class <method>Server` in `src/server/<method>.py`
KarhouTam commented 4 months ago

Thank for pointing this out and your solution.