SymbioticLab / FedScale

FedScale is a scalable and extensible open-source federated learning (FL) platform.
https://fedscale.ai
Apache License 2.0
388 stars 119 forks source link

AttributeError: 'numpy.ndarray' object has no attribute 'to' #229

Closed li1553770945 closed 1 year ago

li1553770945 commented 1 year ago

What happened + What you expected to happen

Traceback (most recent call last):
  File "/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/aggregator.py", line 899, in <module>
    aggregator.run()
  File "/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/aggregator.py", line 370, in run
    self.event_monitor()
  File "/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/aggregator.py", line 873, in event_monitor
    self.deserialize_response(data))
  File "/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/aggregator.py", line 425, in client_completion_handler
    self.update_weight_aggregation(results)
  File "/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/aggregator.py", line 443, in update_weight_aggregation
    self.model_wrapper.set_weights(copy.deepcopy(self.model_weights))
  File "/data/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/internal/torch_model_adapter.py", line 35, in set_weights
    self.optimizer.update_round_gradient(weights, current_grad_weights, self.model)
  File "/data/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/optimizers.py", line 37, in update_round_gradient
    last_model = [x.to(device=self.device) for x in last_model]
  File "/data/home/yaning/CodeFolder/FedScaleOrigin/fedscale/cloud/aggregation/optimizers.py", line 37, in <listcomp>
    last_model = [x.to(device=self.device) for x in last_model]
AttributeError: 'numpy.ndarray' object has no attribute 'to'

In fedscale/cloud/internal/torch_model_adapter.py", line 35, in set_weights, weights is a list of np.ndarray. It calls the function optimizer.update_round_gradient, which is in fedscale/cloud/aggregation/optimizers.py, line 37, and the code last_model = [x.to(device=self.device) for x in last_model] report an error because x is a np.ndarray, not a torch.Tensor(), it dosn't have method to.

Versions / Dependencies

fedscale==0.5

python==3.7.16

os:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"

Reproduction script

# Configuration file of FAR training experiment

# ========== Cluster configuration ========== 
# ip address of the parameter server (need 1 GPU process)
ps_ip: 10.128.201.124

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips: 
    # - 10.128.201.129:[5,5] # worker_ip: [(# processes on gpu) for gpu in available_gpus]
    - 10.128.201.124:[3,3]

exp_path: $FEDSCALE_HOME/fedscale/cloud

# Entry function of executor and aggregator under $exp_path
executor_entry: execution/executor.py

aggregator_entry: aggregation/aggregator.py

auth:
    ssh_user: "yaning"
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source $HOME/anaconda3/bin/activate fedscaleorigin   
    - export NCCL_SOCKET_IFNAME='enp94s0f0'         # Run "ifconfig" to ensure the right NIC for nccl if you have multiple NICs

# ========== Additional job configuration ========== 
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found

job_conf: 
    - job_name: google_speech                   # Generate logs under this folder: log_path/job_name/time_stamp
    - use_cuda: True
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - task: speech
    - num_participants: 50                     # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: google_speech                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/google_speech    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/google_speech/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: resnet34                            # Models: e.g., shufflenet_v2_x2_0, mobilenet_v2, resnet34, albert-base-v2
    - gradient_policy: fed-yogi              # {"fed-yogi", "fed-prox", "fed-avg"}, "fed-avg" by default
    - eval_interval: 10                    # How many rounds to run a testing on the testing set
    - rounds: 1000                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 4
    - yogi_eta: 3e-3 
    - yogi_tau: 1e-8
    - local_steps: 30
    - learning_rate: 0.05
    - batch_size: 16
    - test_bsz: 20
    - sample_mode: oort             
    - save_checkpoint: False

Issue Severity

High: It blocks me from completing my task.