ValueError: Attempting to unscale FP16 gradients.

mutd-bru8 commented 3 months ago

Hi, I am trying to run the train_pix2pix_turbo.py on wsl2 ubuntu and I get below error.

ValueError: Attempting to unscale FP16 gradients.

How can I cope with this error? Does anyone know that?

GaParmar commented 3 months ago

Could you share what accelerate training config you were using when you encountered this error?

-Gaurav

mutd-bru8 commented 3 months ago

@GaParmar Thank you for replaying. Here is my accelerate config.

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

GaParmar commented 3 months ago

It looks like you are trying to do mixed precision (fp16) training. That might be the source of the issues.

mutd-bru8 commented 3 months ago

@GaParmar When I use that config, I set fp16 for mixed_precision in argument.

Could you tell me how config should be set up? I am encountering the same error when I change the mixed_precision setting.

GaParmar commented 3 months ago

Could you try with a config like this:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

kosmels commented 3 months ago

Hello! Firstly, @GaParmar thank you for sharing this repository with us! :slightly_smiling_face:

I am encountering the same issue as @mutd-bru8. The main reason is that we WANT to do fp16 training and that's why we are getting this error. The question is, whether there is a possibility to do fp16 training withing this repository.

With accelerate config that you provided it is still not possible to do fp16 training and the error stays the same.

Here is full error message:

  File "/root/img2img-turbo/src/train_pix2pix_turbo.py", line 307, in <module>
    main(args)
  File "/root/img2img-turbo/src/train_pix2pix_turbo.py", line 190, in main
    accelerator.clip_grad_norm_(layers_to_opt, args.max_grad_norm)
  File "/root/img2img-turbo/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2145, in clip_grad_norm_
    self.unscale_gradients()
  File "/root/img2img-turbo/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2095, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/root/img2img-turbo/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/root/img2img-turbo/.venv/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

GaParmar commented 3 months ago

Ah I wrote the training code for fp32 training. It should be possible to change the training script to support fp16 training. I will take a look in a bit to do this. But if you are familiar with this, feel free to implement it and make a PR!

GaParmar / img2img-turbo

ValueError: Attempting to unscale FP16 gradients. #18