TL-System / plato

A federated learning framework to support scalable and reproducible research
Apache License 2.0
337 stars 79 forks source link

[BUG] IndexError when running examples/fei/fei.py #160

Closed cuiboyuan closed 2 years ago

cuiboyuan commented 2 years ago

Describe the bug When running examples/fei.py with config fei_FashionMNIST_lenet5.yml, an IndexError was raised in rl_server.py when the server tried to the federated averaging in the first round.

To Reproduce Steps to reproduce the behavior:

  1. Run python examples/fei/fei.py -c fei_FashionMNIST_lenet5.yml
  2. Wait until the server does the first round of federated averaging
  3. Encounter IndexError

Expected behavior No error should be raised

Screenshots The following snippet is the Traceback of the error

[INFO][14:40:48]: [Server #5476] All 10 client report(s) received. Processing.
[INFO][14:40:48]: [RL Agent] Preparing action...
[INFO][14:40:48]: [RL Agent] Selecting action...
[ERROR][14:40:48]: Task exception was never retrieved
...
Traceback (most recent call last):
...
  File "path-to-plato\plato\plato\servers\fedavg.py", line 127, in aggregate_weights
    update = await self.federated_averaging(updates)
  File "path-to-plato\plato\plato\utils\reinforcement_learning\rl_server.py", line 84, in federated_averaging
    avg_update[name] += delta * self.smart_weighting[i][0]
IndexError: invalid index to scalar variable.

OS environment (please complete the following information):

Additional context I tried to remove the [0] at the end of line 84 in rl_server.py and the program seems to proceed normally without errors, but I checked for the value of self.smart_weighting and it is always a vector of ten 0.1 for each round. I'm unsure whether that is the expected behavior.

silviafeiwang commented 2 years ago

Thanks for pointing out the bug and providing the details. The self.smart_weighting is always a vector of ten 0.1 for each round at the beginning of the training because the agent adopts the FedAvg aggregation policy in FEI's design, and the parameter algorithm:start_steps in yaml config file determines how many steps the agent conducts this policy prior to the RL policy. In the given config file, the numbers of data samples are the same over the clients, so the weighting will be evenly divided from 1 according to the FedAvg algorithm.

As for the bug, after I reproduced it, I conjectured that the error has something to do with the config file fei_FashionMNIST_lenet5.yml you used with, which may not be compatible with the latest version of the framework due to certain parameter settings. And this is my fault that I failed to keep things updated on GitHub and to maintain a clear documentation of using FEI. I updated some example config files I'm currently using under directory examples/fei/ (for training FEI from scratch). The index error should not occur again with these config files. Also, I set data:variable_partition to true there, so even at the beginning of the training with FedAvg aggregation policy, you're expected to see a more uneven value of self.smart_weighting.

cuiboyuan commented 2 years ago

Thank you for the detailed explanation. The error is now resolved. Thanks!