Training stops at incorrect step prematurely and doesn't go to the specified number of steps

cs-mshah commented 2 months ago

Describe the bug

I tried training brushnet with on my custom data and with my own dataloader, but I noticed something strange: I am using the following command for training:

accelerate launch examples/brushnet/train_brushnet.py \
--pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 \
--output_dir runs/logs/sd15_full \
--train_data_dir data/ \
--resolution 512 \
--seed 1234 \
--learning_rate 1e-5 \
--train_batch_size 2 \
--max_train_steps 100000 \
--tracker_project_name brushnet \
--report_to wandb \
--resume_from_checkpoint latest \
--validation_steps 2000 \
--checkpointing_steps 10000

I am training on 8 V100 GPUS and hence the batch size is 16. There is no gradient checkpointing. Although the tqdm loader shows the end of the progress-bar till 100000 steps, it stops at 34400 without any error and saves the final checkpoint. The intermediate steps also seem to update and weights are stored at the correct timesteps i.e at intervals of 10k. This event is correctly shown in the progress-bar. Is there some issue in the calculations of updating the tqdm state, perhaps in the accelerator.sync_gradients? The progress-bar is giving a wrong impression of the time required to train and also whether the training completed successfully or not.

Has someone else faced the issue? Can you kindly help.

Reproduction

accelerate launch examples/brushnet/train_brushnet.py \
--pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 \
--output_dir runs/logs/sd15_full \
--train_data_dir data/ \
--resolution 512 \
--seed 1234 \
--learning_rate 1e-5 \
--train_batch_size 2 \
--max_train_steps 100000 \
--tracker_project_name brushnet \
--report_to wandb \
--resume_from_checkpoint latest \
--validation_steps 2000 \
--checkpointing_steps 10000

Output:

07/14/2024 15:14:10 - INFO - __main__ - ***** Running training *****
07/14/2024 15:14:10 - INFO - __main__ -   Num examples = 195198
07/14/2024 15:14:10 - INFO - __main__ -   Num batches each epoch = 97599
07/14/2024 15:14:10 - INFO - __main__ -   Num Epochs = 2
07/14/2024 15:14:10 - INFO - __main__ -   Instantaneous batch size per device = 2
07/14/2024 15:14:10 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
07/14/2024 15:14:10 - INFO - __main__ -   Gradient Accumulation steps = 1
07/14/2024 15:14:10 - INFO - __main__ -   Total optimization steps = 100000

Logs

No response

System Info

Same diffusers version as provided in this repository.

Who can help?

No response

cs-mshah commented 2 months ago

This isn't a code bug. After resuming from the checkpoint it does go on to train for the desired steps. I think this might be something to do with job scheduling. I tried running from the checkpoint and the training again stopped after 10hrs. But if the job were to be killed, then the run would crash. It isn't crashing either. So not sure if there is something within huggingface accelerate, diffusers or is it the server constraints imposed on the job.

cs-mshah commented 2 months ago

I removed the the model code and just ran on the dataloader and accelerate log to log lr, but again the steps do not reach the full progressbar length. Has anyone else faced this issue?

cs-mshah commented 2 months ago

Found the issue:

num_update_steps_per_epoch = math.ceil(train_dataloader_len / args.gradient_accumulation_steps)

train_dataloader_len doesn't get updated after the accelerate.prepare. So if using a custom dataset, change this to:

num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)

TencentARC / BrushNet