Open a-l-e-x-d-s-9 opened 1 year ago
I'm training locally with 8GB vram card. To make training faster, I have changed in the settings, gradient_accumulation_steps from 1 to 16, also I run "accelerate config" and changed the value there. But now when I run the script with the settings:
accelerate launch --mixed_precision="fp16" train_dreambooth.py \ --pretrained_model_name_or_path="$MODEL_NAME" \ --instance_data_dir="$INSTANCE_DIR" \ --output_dir="$OUTPUT_DIR" \ --instance_prompt="audra miller" \ --resolution=512 \ --train_batch_size=1 \ --sample_batch_size=1 \ --gradient_accumulation_steps=16 --gradient_checkpointing \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=2800 \ --save_interval 300 \ --save_min_steps 1000
The script doesn't stop at 2800 steps, it continue to run over 3600 steps, and I had to terminate it manually. Checkpoint generated fine.
Set gradient_accumulation_steps=16 Steps number = training images multiplied by 100. Not stopping at the defined number of steps.
Steps: : 3663it [44:35, 1.85it/s, loss=0.145, lr=4e-6][2023-02-07 12:41:03,231] [INFO] [timer.py:197:stop] 0/3664, RunningAvgSamplesPerSec=1.3890020784507544, CurrSamplesPerSec=0.2603498963207169, MemAllocated=1.67GB, MaxMemAllocated=4.91GB
diffusers
Describe the bug
I'm training locally with 8GB vram card. To make training faster, I have changed in the settings, gradient_accumulation_steps from 1 to 16, also I run "accelerate config" and changed the value there. But now when I run the script with the settings:
The script doesn't stop at 2800 steps, it continue to run over 3600 steps, and I had to terminate it manually. Checkpoint generated fine.
Reproduction
Set gradient_accumulation_steps=16 Steps number = training images multiplied by 100. Not stopping at the defined number of steps.
Logs
System Info
diffusers
version: 0.12.1