Error when trying to reproduce on Colab Pro Plus with A100 GPU

mrm8488 commented 2 years ago

I am getting the following error when runing the notebook on colab pro plus with one A100 GPU:

  File "main.py", line 812, in <module>
    trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/properties.py", line 421, in from_argparse_args
    return from_argparse_args(cls, args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 52, in from_argparse_args
    return cls(**trainer_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
    return fn(self, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 446, in __init__
    terminate_on_nan,
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 50, in on_trainer_init
    self.configure_accumulated_gradients(accumulate_grad_batches)
  File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 66, in configure_accumulated_gradients
    raise TypeError("Gradient accumulation supports only int and dict types")
TypeError: Gradient accumulation supports only int and dict types

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 909, in <module>
    if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

madaan commented 2 years ago

This can happen if the model (sd-v1-4-full-ema.ckpt) is not present where the main.py expects it to be (local directory). Is the model present in your local directory or path you provide to --finetune_from?

viba98 commented 2 years ago

I face the same issue. But in the Pokemon example notebook, the ckpt_ path is defined already here: ckpt_path = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4-full-ema.ckpt", use_auth_token=True) Am I missing something?

justinpinkney commented 2 years ago

This TypeError: Gradient accumulation supports only int and dict types suggests that the accumulate batches argument is wrong.

This is set by: lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES" typically I set it to 1.

devonbrackbill commented 1 year ago

Has anyone been able to resolve this?

I don't think it's a path issue b/c ckpt_path = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4-full-ema.ckpt", use_auth_token=True) returns the right path in the /root directory to the .ckpt file.
And I don't think it's the wrong argument to lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES" because I have ACCUMULATE_BATCHES = 1 set to one.

For reference, I'm trying to train this on a single GPU, and so is the OP from the sounds of it (running "the notebook on colab pro plus with one A100 GPU") , so I don't know if that affects the setup here (in particular does ACCUMULATE_BATCHES need to be a different value?)

For reference, my python main.py call is (had to change --gpus "$gpu_list" \ to --auto_select_gpus \ b/c I was getting a different error (error: argument --gpus: invalid _gpus_allowed_type value: ''):

(python main.py \
    -t \
    --base "$YAML-PATH" \
    --auto_select_gpus \
    --scale_lr False \
    --num_nodes 1 \
    --check_val_every_n_epoch 10 \
    --finetune_from "$ckpt_path" \
    data.params.batch_size="$BATCH_SIZE" \
    lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES" \
    data.params.validation.params.n_gpus="$NUM_GPUS" \
)

with

BATCH_SIZE = 4
N_GPUS = 1
ACCUMULATE_BATCHES = 1

treksis commented 1 year ago

Solved Change to NUM_GPUS instead. Running on colab pro+ with A100

# A100:
BATCH_SIZE = 4
NUM_GPUS = 1
ACCUMULATE_BATCHES = 1

gpu_list = ",".join((str(x) for x in range(NUM_GPUS))) + ","
print(f"Using GPUs: {gpu_list}")

lvsi-qi commented 1 year ago

解决了对NUM_GPUS的更改。在带有 A100 的 colab pro+ 上运行
# A100:
BATCH_SIZE = 4
NUM_GPUS = 1
ACCUMULATE_BATCHES = 1

gpu_list = ",".join((str(x) for x in range(NUM_GPUS))) + ","
print(f"Using GPUs: {gpu_list}")

Hello,I encountered this problem on colab. Have you encountered it?How to solve this problem? 53c5de0017b1d1461791d499f5b4726

treksis commented 1 year ago

It has been a while. I would rather use this colab notebook for training. https://github.com/Linaqruf/kohya-trainer

LambdaLabsML / examples

Error when trying to reproduce on Colab Pro Plus with A100 GPU #14