SymbioticLab / FedScale

FedScale is a scalable and extensible open-source federated learning (FL) platform.
https://fedscale.ai
Apache License 2.0
387 stars 119 forks source link

FileNotFoundError in load_global_model function #55

Closed etesami closed 2 years ago

etesami commented 2 years ago

Description

Occasionally, FileNotFoundError is thrown for the temp_model_path in the load_global_model function of the executor.py: https://github.com/SymbioticLab/FedScale/blob/33de86d2a086084b71a5198db3aa5e6cdbfbf624/core/executor.py#L238-L242

Environment

OS: Ubuntu 18.03 with bash as the default shell Running on CPU with more than one worker

Error logs

(11-20) 15:03:22 INFO     [executor.py:326] Executor 2: Received (Event:TRAIN) from aggregator
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_3.pth.tar'
Traceback (most recent call last):
  File "/home/ubuntu/FedScale/core/executor.py", line 349, in <module>
    executor.run()
  File "/home/ubuntu/FedScale/core/executor.py", line 205, in run
    self.event_monitor()
  File "/home/ubuntu/FedScale/core/executor.py", line 332, in event_monitor
    train_res = self.training_handler(clientId=clientId, conf=client_conf)
  File "/home/ubuntu/FedScale/core/executor.py", line 266, in training_handler
    client_model = self.load_global_model()
  File "/home/ubuntu/FedScale/core/executor.py", line 240, in load_global_model
    with open(self.temp_model_path, 'rb') as model_in:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/FedScale/core/evals/logs/femnist/1120_150258/executor/model_2.pth.tar'
(11-20) 15:03:23 INFO     [executor.py:326] Executor 1: Received (Event:TRAIN) from aggregator

Config file:

ps_ip: 10.30.72.19

worker_ips: 
    - 10.30.72.9:[1]
    - 10.30.72.29:[1]
    - 10.30.72.30:[1]
    # - 10.30.72.31:[1]
    # - 10.30.72.32:[1]
    # - 10.30.72.33:[1]
    # - 10.30.72.34:[1]

exp_path: $HOME/FedScale/core

executor_entry: executor.py

aggregator_entry: aggregator.py

auth:
    ssh_user: ""
    ssh_private_key: ~/.ssh/id_rsa

setup_commands:
    - source $HOME/anaconda3/bin/activate fedscale
    - export NCCL_SOCKET_IFNAME='eth0' 

job_conf: 
    - job_name: femnist
    - log_path: $HOME/FedScale/core/evals
    - total_worker: 4
    - data_set: femnist
    - data_dir: $HOME/FedScale/dataset/data/femnist
    - data_map_file: $HOME/FedScale/dataset/data/femnist/client_data_mapping/train.csv
    - gradient_policy: yogi
    - eval_interval: 30
    - epochs: 3
    - filter_less: 21
    - num_loaders: 1
    - yogi_eta: 3e-3 
    - yogi_tau: 1e-8
    - local_steps: 20
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - malicious_factor: 4
fanlai0990 commented 2 years ago

Hi Ehsan. This ("No such file or directory") seems a bit weird, as I think the code logic should work fine, and we have never encountered this. Have you checked why this can happen?

Btw, thanks for your PR. But it can lead to the performance issue, as directly reusing self.model in exception will force the next client to reuse the last client model. Lol.

etesami commented 2 years ago

Hey @fanlai0990, I am unable to find the root cause of this. However, occasionally, I am experiencing this, whenever I run an experiment multiple times. I suspect this is related to cleaning the previous run. I close this issue now and might reopen it if I have more information.