LambdaLabsML / examples

Deep Learning Examples
MIT License
805 stars 103 forks source link

can't reproduce and got noise issue #33

Closed jnkr36 closed 1 year ago

jnkr36 commented 1 year ago

TRAIN: I follow the example and use V100 to reproduce, I just change the batch size from 4 to 1 in configs/stable-diffusion/pokemon.yaml python main.py -t --base configs/stable-diffusion/pokemon.yaml --gpus 1 --scale_lr False --num_nodes 1 --check_val_every_n_epoch 10 --finetune_from sd-v1-4-full-ema.ckpt

TEST: After training about 300 epochs, I use scripts/txt2img.py to test: 1、first I use the original checkpoint sd-v1-4-full-ema.ckpt to test and get the below result: python scripts/txt2img.py --prompt 'robotic cat with wings' --outdir '/outputs/generated_pokemon' --H 512 --W 512 --n_samples 4 --config '/configs/stable-diffusion/pokemon.yaml' --ckpt 'sd-v1-4-full-ema.ckpt' image 2、and then I use epoch=000002.ckpt、epoch=000004.ckpt、epoch=000007.ckpt、epoch=000009.ckpt、epoch=0000012.ckpt......to test again, and the result becomes more and more like noise, and at last i only generate all black picture. python scripts/txt2img.py --prompt 'robotic cat with wings' --outdir '/outputs/generated_pokemon' --H 512 --W 512 --n_samples 4 --config '/configs/stable-diffusion/pokemon.yaml' --ckpt 'logs/2022-10-28T12-32-02_pokemon/checkpoints/epoch=000002.ckpt' 1)epoch=000002.ckpt result: image 2)epoch=000004.ckpt result: image 3)epoch=000007.ckpt result: image 4)epoch=000009.ckpt result: image 5)epoch=000012.ckpt result: image 6)epoch=000014.ckpt result: image ...... 7)epoch=000048.ckpt result: image Is there anyone meet the same issue? or could you someone help to solve the problem?

lk-wq commented 1 year ago

You mentioned you set the batch size from 4 to 1, did you also scale the learning rate down?

jnkr36 commented 1 year ago

You mentioned you set the batch size from 4 to 1, did you also scale the learning rate down?

Thanks so much for your suggestion. I scale base_learning_rate in configs/stable-diffusion/pokemon.yaml from 1.0e-04 to 2.0e-05 and try again, it seems that it can work now. Currently, training process has run to epoch 129, I will run it for more time and hope to get better results 1)epoch=000052.ckpt result: image

2)epoch=000100.ckpt result: image

3)epoch=000129.ckpt result: image

tenghui98 commented 1 year ago

Wow, I also encontered this problem. Thanks!

xuzekai1997 commented 1 year ago

hello, could you share your yaml file.When I run main.py, it always got wrong.But I have no idea of which part is wrong. Thanks a lot.