FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.
https://TensorOpera.ai
Apache License 2.0
4.15k stars 777 forks source link

Nooo GPU Device FedGKT #417

Closed vanwayplus closed 2 years ago

vanwayplus commented 2 years ago

I'm trying to run simulation on my server equipped with 8 GPU,and my code is based on https://github.com/FedML-AI/FedML/tree/master/python/examples/simulation/mpi_torch_fedgkt_mnist_lr_example and I redefined the gpu_mapping.yaml: mapping_default: air-01-GPU:[3,3,3,2]

and this is my config/fedml_config.yaml

common_args:
  training_type: "simulation"
  random_seed: 0

data_args:
  dataset: "cifar10"
  data_cache_dir: ~/fedml_data
  partition_method: "hetero"
  partition_alpha: 0.5

model_args:
  model: "resnet56"

train_args:
  federated_optimizer: "FedGKT"
  client_id_list: "[]"
  client_num_in_total: 5
  client_num_per_round: 3
  comm_round: 2
  epochs: 1
  batch_size: 100
  client_optimizer: sgd
  learning_rate: 0.03
  weight_decay: 0.001
  server_optimizer: sgd
  lr: 0.001
  server_lr: 0.001
  wd: 0.001
  ci: 0
  server_momentum: 0.9
  no_bn_wd: false
  optimizer: "SGD"
  temperature: 100
  whether_training_on_client: false
  epochs_client: 2
  epochs_server: 2
  sweep: 1
  whether_distill_on_the_server: true
  alpha: 0.5 #knowledge distillation parameter

validation_args:
  frequency_of_the_test: 5

device_args:
  worker_num: 5
  using_gpu: true
  gpu_mapping_file: config/gpu_mapping.yaml
  gpu_mapping_key: mapping_default
  multi_gpu_server: true

comm_args:
  backend: "MPI"
  is_mobile: 0

tracking_args:
  log_file_dir: ./log
  enable_wandb: false
  wandb_key: ee0b5f53d949c84cee7decbe7a629e63fb2f8408
  wandb_project: my_fedml
  wandb_name: fedml_torch_fedavg_mnist_lr
  using_mlops: false

And when I'm running the example, here is the shell:

======== FedML (https://fedml.ai) ========
FedML version: 0.7.212
Execution path:/xxx/python3.7/site-packages/fedml/__init__.py

======== Running Environment ========
OS: Linux-5.15.0-41-generic-x86_64-with-debian-bullseye-sid
Hardware: x86_64
Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
PyTorch version: 1.12.0
MPI4py is installed

======== CPU Configuration ========
The CPU usage is : 9%
Available CPU Memory: 191.6 G / 251.69030380249023G

======== GPU Configuration ========
No GPU devices
[]
args.client_id_list = None
args.client_id_list is not None
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 00:01:14] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config/fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config/fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'cifar10', 'data_cache_dir': '/data/aiot1/fedml_data', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'resnet56', 'federated_optimizer': 'FedGKT', 'client_id_list': '[]', 'client_num_in_total': 5, 'client_num_per_round': 3, 'comm_round': 2, 'epochs': 1, 'batch_size': 100, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'server_optimizer': 'sgd', 'lr': 0.001, 'server_lr': 0.001, 'wd': 0.001, 'ci': 0, 'server_momentum': 0.9, 'no_bn_wd': False, 'optimizer': 'SGD', 'temperature': 100, 'whether_training_on_client': False, 'epochs_client': 2, 'epochs_server': 2, 'sweep': 1, 'whether_distill_on_the_server': True, 'alpha': 0.5, 'frequency_of_the_test': 5, 'worker_num': 11, 'using_gpu': True, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'multi_gpu_server': True, 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'my_fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x7f204634b770>, 'process_id': 2, 'sys_perf_profiling': True}
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 00:01:14] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config/fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config/fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'cifar10', 'data_cache_dir': '/data/aiot1/fedml_data', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'resnet56', 'federated_optimizer': 'FedGKT', 'client_id_list': '[]', 'client_num_in_total': 5, 'client_num_per_round': 3, 'comm_round': 2, 'epochs': 1, 'batch_size': 100, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'server_optimizer': 'sgd', 'lr': 0.001, 'server_lr': 0.001, 'wd': 0.001, 'ci': 0, 'server_momentum': 0.9, 'no_bn_wd': False, 'optimizer': 'SGD', 'temperature': 100, 'whether_training_on_client': False, 'epochs_client': 2, 'epochs_server': 2, 'sweep': 1, 'whether_distill_on_the_server': True, 'alpha': 0.5, 'frequency_of_the_test': 5, 'worker_num': 11, 'using_gpu': True, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'multi_gpu_server': True, 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'my_fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x7f10f1e0f730>, 'process_id': 1, 'sys_perf_profiling': True}
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 00:01:14] [INFO] [gpu_mapping_mpi.py:25:mapping_processes_to_gpu_device_from_yaml_file_mpi] gpu_util = xxx:[3,3,3,2]

and

No GPU devices

in GPU Configuration refused me

Actually there's anothor kind of error occured:

======== FedML (https://fedml.ai) ========
FedML version: 0.7.212
Execution path:/xxx/fedml/lib/python3.7/site-packages/fedml/__init__.py

======== Running Environment ========
OS: Linux-5.15.0-41-generic-x86_64-with-debian-bullseye-sid
Hardware: x86_64
Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) 
[GCC 7.5.0]
PyTorch version: 1.12.0
MPI4py is installed

======== CPU Configuration ========
The CPU usage is : 7%
Available CPU Memory: 193.3 G / 251.69030380249023G

======== GPU Configuration ========
No GPU devices
[]
args.client_id_list = None
args.client_id_list is not None
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 00:30:43] [ERROR] [mlops_runtime_log.py:32:handle_exception] Uncaught exception
Traceback (most recent call last):
  File "fedgkt_cifar10_resnet56_step_by_step.py", line 9, in <module>
    device = fedml.device.get_device(args)
  File "/xxxx/fedml/lib/python3.7/site-packages/fedml/device/device.py", line 43, in get_device
    args.gpu_mapping_key,
  File "/xxxl/lib/python3.7/site-packages/fedml/device/gpu_mapping_mpi.py", line 28, in mapping_processes_to_gpu_device_from_yaml_file_mpi
    for host, gpus_util_map_host in gpu_util.items():
AttributeError: 'str' object has no attribute 'items'
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 00:30:43] [ERROR] [mlops_runtime_log.py:32:handle_exception] Uncaught exception
Traceback (most recent call last):
  File "fedgkt_cifar10_resnet56_step_by_step.py", line 9, in <module>
    device = fedml.device.get_device(args)
  File "/xxx/lib/python3.7/site-packages/fedml/device/device.py", line 43, in get_device
    args.gpu_mapping_key,
  File "/xxx/fedml/lib/python3.7/site-packages/fedml/device/gpu_mapping_mpi.py", line 28, in mapping_processes_to_gpu_device_from_yaml_file_mpi
    for host, gpus_util_map_host in gpu_util.items():
AttributeError: 'str' object has no attribute 'items'

@chaoyanghe

chaoyanghe commented 2 years ago

air-01-GPU:[3,3,3,2] means you will use 10 clients (1 server)

but

your fedml_config.yaml shows

  client_num_in_total: 5
  client_num_per_round: 3
  worker_num: 5

Please align them. You can set

  client_num_in_total: 10
  client_num_per_round: 10
  worker_num: 10
vanwayplus commented 2 years ago

Hello, I'v set the config as you mentioned:

common_args:
  training_type: "simulation"
  random_seed: 0

data_args:
  dataset: "cifar10"
  data_cache_dir: ~/fedml_data
  partition_method: "hetero"
  partition_alpha: 0.5

model_args:
  model: "resnet56"

train_args:
  federated_optimizer: "FedGKT"
  client_id_list: "[]"
  client_num_in_total: 10
  client_num_per_round: 10
  comm_round: 2
  epochs: 1
  batch_size: 100
  client_optimizer: sgd
  learning_rate: 0.03
  weight_decay: 0.001
  server_optimizer: sgd
  lr: 0.001
  server_lr: 0.001
  wd: 0.001
  ci: 0
  server_momentum: 0.9
  no_bn_wd: false
  optimizer: "SGD"
  temperature: 100
  whether_training_on_client: false
  epochs_client: 2
  epochs_server: 2
  sweep: 1
  whether_distill_on_the_server: true
  alpha: 0.5 #knowledge distillation parameter

validation_args:
  frequency_of_the_test: 5

device_args:
  worker_num: 10
  using_gpu: true
  gpu_mapping_file: config/gpu_mapping.yaml
  gpu_mapping_key: mapping_default
  multi_gpu_server: true

comm_args:
  backend: "MPI"
  is_mobile: 0

tracking_args:
  log_file_dir: ./log
  enable_wandb: false
  wandb_key: ee0b5f53d949c84cee7decbe7a629e63fb2f8408
  wandb_project: my_fedml
  wandb_name: fedml_torch_fedavg_mnist_lr
  using_mlops: false

this is my params shown:

[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 10:59:58] [INFO] [__init__.py:96:init] args = {'yaml_config_file': 'config/fedml_config.yaml', 'run_id': '0', 'rank': 0, 'local_rank': 0, 'node_rank': 0, 'role': 'client', 'yaml_paths': ['config/fedml_config.yaml'], 'training_type': 'simulation', 'random_seed': 0, 'dataset': 'cifar10', 'data_cache_dir': '/data/aiot1/fedml_data', 'partition_method': 'hetero', 'partition_alpha': 0.5, 'model': 'resnet56', 'federated_optimizer': 'FedGKT', 'client_id_list': '[]', 'client_num_in_total': 10, 'client_num_per_round': 10, 'comm_round': 2, 'epochs': 1, 'batch_size': 100, 'client_optimizer': 'sgd', 'learning_rate': 0.03, 'weight_decay': 0.001, 'server_optimizer': 'sgd', 'lr': 0.001, 'server_lr': 0.001, 'wd': 0.001, 'ci': 0, 'server_momentum': 0.9, 'no_bn_wd': False, 'optimizer': 'SGD', 'temperature': 100, 'whether_training_on_client': False, 'epochs_client': 2, 'epochs_server': 2, 'sweep': 1, 'whether_distill_on_the_server': True, 'alpha': 0.5, 'frequency_of_the_test': 5, 'worker_num': 11, 'using_gpu': True, 'gpu_mapping_file': 'config/gpu_mapping.yaml', 'gpu_mapping_key': 'mapping_default', 'multi_gpu_server': True, 'backend': 'MPI', 'is_mobile': 0, 'log_file_dir': './log', 'enable_wandb': False, 'wandb_key': 'ee0b5f53d949c84cee7decbe7a629e63fb2f8408', 'wandb_project': 'my_fedml', 'wandb_name': 'fedml_torch_fedavg_mnist_lr', 'using_mlops': False, 'comm': <mpi4py.MPI.Intracomm object at 0x7f2b1d8a3770>, 'process_id': 1, 'sys_perf_profiling': True}
[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 10:59:58] [INFO] [gpu_mapping_mpi.py:25:mapping_processes_to_gpu_device_from_yaml_file_mpi] gpu_util = air-01-GPU:[3,3,3,2]

but it's still telling me:

[FedML-Server(0) @device-id-0] [Thu, 28 Jul 2022 10:59:58] [ERROR] [mlops_runtime_log.py:32:handle_exception] Uncaught exception
Traceback (most recent call last):
  File "fedgkt_cifar10_resnet56_step_by_step.py", line 9, in <module>
    device = fedml.device.get_device(args)
  File "xxx/fedml/lib/python3.7/site-packages/fedml/device/device.py", line 43, in get_device
    args.gpu_mapping_key,
  File "xxx/fedml/lib/python3.7/site-packages/fedml/device/gpu_mapping_mpi.py", line 28, in mapping_processes_to_gpu_device_from_yaml_file_mpi
    for host, gpus_util_map_host in gpu_util.items():
AttributeError: 'str' object has no attribute 'items'

but all 8 GPU are avaliable

>>> print(torch.cuda.is_available()) 
True
>>> print(torch.cuda.device_count()) 
8
>>> 

that's really strange. @chaoyanghe

chaoyanghe commented 2 years ago

it clearly tells you that the mapping config has string format issue. The correct one should be :

mapping_default:
    ChaoyangHe-GPU-RTX2080Tix4: [3, 3, 3, 2]
vanwayplus commented 2 years ago

it clearly tells you that the mapping config has string format issue. The correct one should be :

mapping_default:
    ChaoyangHe-GPU-RTX2080Tix4: [3, 3, 3, 2]

THX! That's really helpful. it's truely the format error. And it works fine now.