deepglugs / dalle

8 stars 4 forks source link

vqvae train issue #4

Closed skywo1f closed 3 years ago

skywo1f commented 3 years ago

I seem to be having an issue training the vqvae. It runs fine until about step 4300-4500 then it suddenly quits: saving complete epoch 1/2 step 4100 loss: 0.000803 - 5it/s epoch 1/2 step 4200 loss: 0.000568 - 5it/s epoch 1/2 step 4300 loss: 0.00045 - 5it/s

any ideas?

deepglugs commented 3 years ago

in train_vae() do a print(len(generator)) around line 160. Is it not finding all your data?

skywo1f commented 3 years ago

70242 ... looks about right for number of images

skywo1f commented 3 years ago

which if the batch size is 16, 16*4500 = 72000... so it seems like its failing at starting the second epoch?

deepglugs commented 3 years ago

Maybe there's an off-by-one error going on. The epoch loop might need to be for epoch in range(1, args.epochs + 1)

skywo1f commented 3 years ago

that seems to have fixed it, thanks!