Closed cs-mshah closed 2 months ago
This isn't a code bug. After resuming from the checkpoint it does go on to train for the desired steps. I think this might be something to do with job scheduling. I tried running from the checkpoint and the training again stopped after 10hrs. But if the job were to be killed, then the run would crash. It isn't crashing either. So not sure if there is something within huggingface accelerate, diffusers or is it the server constraints imposed on the job.
I removed the the model code and just ran on the dataloader and accelerate log to log lr, but again the steps do not reach the full progressbar length. Has anyone else faced this issue?
Found the issue:
num_update_steps_per_epoch = math.ceil(train_dataloader_len / args.gradient_accumulation_steps)
train_dataloader_len
doesn't get updated after the accelerate.prepare
. So if using a custom dataset, change this to:
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
Describe the bug
I tried training brushnet with on my custom data and with my own dataloader, but I noticed something strange: I am using the following command for training:
I am training on 8 V100 GPUS and hence the batch size is 16. There is no gradient checkpointing. Although the tqdm loader shows the end of the progress-bar till 100000 steps, it stops at 34400 without any error and saves the final checkpoint. The intermediate steps also seem to update and weights are stored at the correct timesteps i.e at intervals of 10k. This event is correctly shown in the progress-bar. Is there some issue in the calculations of updating the tqdm state, perhaps in the
accelerator.sync_gradients
? The progress-bar is giving a wrong impression of the time required to train and also whether the training completed successfully or not.Has someone else faced the issue? Can you kindly help.
Reproduction
Output:
Logs
No response
System Info
Same diffusers version as provided in this repository.
Who can help?
No response