LambdaLabsML / examples

Deep Learning Examples
MIT License
805 stars 103 forks source link

bugfix(env): fix accumulate batches, gpu list #65

Open megatran opened 1 year ago

megatran commented 1 year ago

The current code has some typos and mismatched python/bash env variables, which cause this exception.

    raise TypeError("Gradient accumulation supports only int and dict types")
TypeError: Gradient accumulation supports only int and dict types

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 939, in <module>
    if trainer.global_rank == 0:
NameError: name 'trainer' is not defined

I propose this current fix

# 2xA6000:
BATCH_SIZE = 4
N_GPUS = 2
ACCUMULATE_BATCHES = 1
GPU_LIST = ",".join((str(x) for x in range(N_GPUS))) + ","
print(f"Using GPUs: {GPU_LIST}")

import os
os.environ["BATCH_SIZE"] = str(BATCH_SIZE) 
os.environ["N_GPUS"] = str(N_GPUS) 
os.environ["ACCUMULATE_BATCHES"] = str(ACCUMULATE_BATCHES)
os.environ["GPU_LIST"] = GPU_LIST
os.environ["CKPT_PATH"] = ckpt_path

!echo "$BATCH_SIZE"
!echo "$N_GPUS"
!echo "$ACCUMULATE_BATCHES"
!echo "$GPU_LIST"
!echo "$CKPT_PATH"
# Run training
!(python main.py \
    -t \
    --base configs/stable-diffusion/pokemon.yaml \
    --gpus "$GPU_LIST" \
    --scale_lr False \
    --num_nodes 1 \
    --check_val_every_n_epoch 10 \
    --finetune_from "$CKPT_PATH" \
    data.params.batch_size="$BATCH_SIZE" \
    lightning.trainer.accumulate_grad_batches="$ACCUMULATE_BATCHES" \
    data.params.validation.params.n_gpus="$N_GPUS" \
)

Now everything works!