Distributed training results in slow convergence

biewubwerqwe commented 3 months ago

Hi, I am using the sample code for timm model training. There is a mismatch when am accelerating the code with GPU and otherwise. What can be the reason for this

There are 3 results in the image

baseline_batch-32 is the values by just doing python train.py
baseline_batch-32_nodist is the results of using accelerate config `accelerate_config_nodist.yaml
baseline_batch-32_1gpu is the results of using accelerate configaccelerate_config_1gpu.yaml`

The config for nodist is compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' gpu_ids: '1' machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

The config for 1 gpu is compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: 3, machine_rank: 0 main_training_function: main mixed_precision: 'no' num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Chris-hughes10 commented 3 months ago

Hi, what is the difference between your configs, they look the same to me?

biewubwerqwe commented 3 months ago

Sorry, that was a mistake. I pasted same config twice. I have corrected now and you can see the differnce

Chris-hughes10 / pytorch-accelerated

Distributed training results in slow convergence #59