Chris-hughes10 / pytorch-accelerated

A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop which is flexible enough to handle the majority of use cases, and capable of utilizing different hardware options with no code changes required. Docs: https://pytorch-accelerated.readthedocs.io/en/latest/
Apache License 2.0
162 stars 21 forks source link

Distributed training results in slow convergence #59

Open biewubwerqwe opened 3 months ago

biewubwerqwe commented 3 months ago

Hi, I am using the sample code for timm model training. There is a mismatch when am accelerating the code with GPU and otherwise. What can be the reason for this

image

There are 3 results in the image

  1. baseline_batch-32 is the values by just doing python train.py
  2. baseline_batch-32_nodist is the results of using accelerate config `accelerate_config_nodist.yaml
  3. baseline_batch-32_1gpu is the results of using accelerate configaccelerate_config_1gpu.yaml`

The config for nodist is compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' gpu_ids: '1' machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

The config for 1 gpu is compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: 3, machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Chris-hughes10 commented 3 months ago

Hi, what is the difference between your configs, they look the same to me?

biewubwerqwe commented 3 months ago

Sorry, that was a mistake. I pasted same config twice. I have corrected now and you can see the differnce