Unable to reproduce - Githubissues

YaqiWangCV commented 2 weeks ago

Hello, thank you very much for your excellent work.

During training, controlnet0, controlnet2, unet1, and unet3 are saved. In the testing phase, I loaded controlnet2 and unet3, but the resulting super-resolution images were completely different from the original ones, with the image elements being totally out of control (they might all be faces, or some kind of landscape images). Could you please tell me what might be causing this?

CSRuiXie commented 2 weeks ago

Thanks for your attention to our work! Can you show me some examples?

YaqiWangCV commented 2 weeks ago

Weixin Image_20240624131015 It seems that the content of images in different epoch is very different.

CSRuiXie commented 2 weeks ago

This phenomenon is very strange; we did not encounter such a situation during the training process. Can you test the results using the pre-trained models we provided? If there are no issues, then the problem might be with the training process. If the issue persists, then it might be a problem with the testing process.

YaqiWangCV commented 2 weeks ago

Testing directly with the pre-trained models you provided works fine; the problem lies in the training process.

This is my training script:

accelerate launch train_addsr.py \
--pretrained_model_name_or_path="/pretrained_models/stable-diffusion-2-base" \
--controlnet_model_name_or_path_Tea='/pretrained_models/SeeSR/models/seesr' \
--unet_model_name_or_path_Tea='/pretrained_models/SeeSR/models/seesr' \
--controlnet_model_name_or_path_Stu='/pretrained_models/SeeSR/models/seesr' \
--unet_model_name_or_path_Stu='/pretrained_models/SeeSR/models/seesr' \
--output_dir ${output_dir} \
--root_folders '/dataset/lowlevel/seesr_ori' \
--ram_ft_path '/pretrained_models/SeeSR/models/DAPE.pth' \
--enable_xformers_memory_efficient_attention \
--mixed_precision="fp16" \
--resolution=512 \
--learning_rate=2e-5 \
--train_batch_size=6 \
--gradient_accumulation_steps=2 \
--null_text_ratio=0.5 \
--dataloader_num_workers=4 \
--max_train_steps=50000 \
--checkpointing_steps=500 \

CSRuiXie commented 2 weeks ago

From your provided training script, the problem might lie in the train_batch_size setting. Since the train_addsr.py script we provided only supports train_batch_size=2, setting it to any other number will result in incorrect loss calculations and training. You can either rewrite the training code to match train_batch_size=6 or set it to 2, and then the problem should disappear. Sorry for the inconvenience.

YaqiWangCV commented 2 weeks ago

I'm very sorry, I tried setting train_batch_size=2, but the issue still persists. In each epoch, the model tends to predict all images as the same thing, for example, all being faces.

CSRuiXie commented 2 weeks ago

Can you show me the loss curves?

CSRuiXie commented 1 week ago

I retrained the network for 5k iterations, and it performed well. Based on your examples, I suspect there might be an issue with how your model is being loaded. Can you provide the detailed training log?

YaqiWangCV commented 1 week ago

Thank you very much. I found out it was the training data.

NJU-PCALab / AddSR

Unable to reproduce #15