Open mrm8488 opened 2 years ago
This can happen if the model (sd-v1-4-full-ema.ckpt
) is not present where the main.py
expects it to be (local directory). Is the model present in your local directory or path you provide to --finetune_from
?
I face the same issue. But in the Pokemon example notebook, the ckpt_ path is defined already here: ckpt_path = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4-full-ema.ckpt", use_auth_token=True) Am I missing something?
This TypeError: Gradient accumulation supports only int and dict types
suggests that the accumulate batches argument is wrong.
This is set by: lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES"
typically I set it to 1.
Has anyone been able to resolve this?
ckpt_path = hf_hub_download(repo_id="CompVis/stable-diffusion-v-1-4-original", filename="sd-v1-4-full-ema.ckpt", use_auth_token=True)
returns the right path in the /root
directory to the .ckpt
file.lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES"
because I have ACCUMULATE_BATCHES = 1
set to one.For reference, I'm trying to train this on a single GPU, and so is the OP from the sounds of it (running "the notebook on colab pro plus with one A100 GPU") , so I don't know if that affects the setup here (in particular does ACCUMULATE_BATCHES need to be a different value?)
For reference, my python main.py
call is (had to change --gpus "$gpu_list" \
to --auto_select_gpus \
b/c I was getting a different error (error: argument --gpus: invalid _gpus_allowed_type value: ''
):
(python main.py \
-t \
--base "$YAML-PATH" \
--auto_select_gpus \
--scale_lr False \
--num_nodes 1 \
--check_val_every_n_epoch 10 \
--finetune_from "$ckpt_path" \
data.params.batch_size="$BATCH_SIZE" \
lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES" \
data.params.validation.params.n_gpus="$NUM_GPUS" \
)
with
BATCH_SIZE = 4
N_GPUS = 1
ACCUMULATE_BATCHES = 1
Solved Change to NUM_GPUS instead. Running on colab pro+ with A100
# A100:
BATCH_SIZE = 4
NUM_GPUS = 1
ACCUMULATE_BATCHES = 1
gpu_list = ",".join((str(x) for x in range(NUM_GPUS))) + ","
print(f"Using GPUs: {gpu_list}")
解决了对NUM_GPUS的更改。在带有 A100 的 colab pro+ 上运行
# A100: BATCH_SIZE = 4 NUM_GPUS = 1 ACCUMULATE_BATCHES = 1 gpu_list = ",".join((str(x) for x in range(NUM_GPUS))) + "," print(f"Using GPUs: {gpu_list}")
Hello,I encountered this problem on colab. Have you encountered it?How to solve this problem?
It has been a while. I would rather use this colab notebook for training. https://github.com/Linaqruf/kohya-trainer
I am getting the following error when runing the notebook on colab pro plus with one A100 GPU: