Open omerb01 opened 2 years ago
I'm also experiencing the same issues. Are your results also very unsaturated?
I'm not sure if you have tried this, but what about setting "clip_denoised" to False (instead of True, which is the default)? It might result in more saturated results.
^ I will try this for my task and let you know how it goes
Thanks @xenova , waiting for your update
After training for another 3 hours with clip_denoised=False, I haven't seen any improvement. Perhaps @Janspiry can provide some extra assistance.
@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.
@xenova @omerb01 Hello, did you solve the issue? I am still having problems in colorization task.
Nope, still struggling with colorization
@ksunho9508 @xenova I am still unable to obtain reliable results. In my opinion, the flicker dataset does not contain enough data to generalize this task via diffusion based methods. The authors of the original paper applied their method to the ImageNet dataset, which contains much more training data.
Hi, guys, sorry for this problem.
Like @omerb01 said, I share the same view that the flicker dataset is too small to coloration for natural scenes. May you should do this task in the ImageNet or Places2. More information can be found in #17
@Janspiry I've also tried on my custom dataset (with millions of images), and I get the same results :/ ... I'm really not sure how this is the only task that is facing these issues; all other tasks seem to work fine.
@xenova I'll make sure there are no bugs in the coloration part of the code
@Janspiry Thank you. And can you add config file of super resolution too?
I also found this problem. I used my own small-scale data set to train it, but still failed to get results after many epoch。 @Janspiry
@omerb01 Have you tried running experiments under the same conditions after changing GroupNorm to BatchNorm? It seems that using BatchNorm instead of GroupNorm can perform colorization to some extent by distinguishing between the background and objects.
I experienced the same problem.
BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting:
'''
INFO: Begin model train.
INFO: train/mse_loss: 0.1167483588039875
INFO: train/mse_loss: 0.0724316855113022
INFO: train/mse_loss: 0.06527451830048543
INFO: epoch: 1
INFO: iters: 23488
INFO: train/mse_loss: 0.020401993506137254
INFO: train/mse_loss: 0.018878939009419112
INFO: train/mse_loss: 0.018366146380821978
INFO: epoch: 2
INFO: iters: 46976
INFO: train/mse_loss: 0.014938667484635498
INFO: train/mse_loss: 0.0148746125182753
INFO: train/mse_loss: 0.014505743447781326
INFO: train/mse_loss: 0.014465472793432741
INFO: epoch: 3
INFO: iters: 70464
INFO: train/mse_loss: 0.014389766222024227
INFO: train/mse_loss: 0.013453237237986066
INFO: train/mse_loss: 0.013306563555842919
INFO: epoch: 4
INFO: iters: 93952
INFO: train/mse_loss: 0.012647044245178611
INFO: train/mse_loss: 0.012807737045385967
INFO: train/mse_loss: 0.011968838741840434
INFO: epoch: 5
INFO: iters: 117440
INFO:
------------------------------Validation Start------------------------------
INFO: val/mae: 0.3139403760433197
INFO:
------------------------------Validation End------------------------------
INFO: train/mse_loss: 0.011829124199711352
INFO: epoch: 6
INFO: iters: 140938
INFO: train/mse_loss: 0.010201521161369924
INFO: epoch: 7
INFO: iters: 164426
INFO: train/mse_loss: 0.010018873226117376
INFO: epoch: 8
INFO: iters: 187914
INFO: train/mse_loss: 0.009995935927926308
INFO: epoch: 9
INFO: iters: 211402
INFO: train/mse_loss: 0.009544536813287326
INFO: epoch: 10
INFO: iters: 234890
INFO: Saving the self at the end of epoch 10
INFO:
------------------------------Validation Start------------------------------
INFO: val/mae: 0.43820616602897644
INFO:
------------------------------Validation End------------------------------
'''
I experienced the same problem. BTW, have you guys checked the training log? According to mine, it seems that the network is sufferred from a severe overfiting: ''' INFO: Begin model train. INFO: train/mse_loss: 0.1167483588039875 INFO: train/mse_loss: 0.0724316855113022 INFO: train/mse_loss: 0.06527451830048543 INFO: epoch: 1 INFO: iters: 23488 INFO: train/mse_loss: 0.020401993506137254 INFO: train/mse_loss: 0.018878939009419112 INFO: train/mse_loss: 0.018366146380821978 INFO: epoch: 2 INFO: iters: 46976 INFO: train/mse_loss: 0.014938667484635498 INFO: train/mse_loss: 0.0148746125182753 INFO: train/mse_loss: 0.014505743447781326 INFO: train/mse_loss: 0.014465472793432741 INFO: epoch: 3 INFO: iters: 70464 INFO: train/mse_loss: 0.014389766222024227 INFO: train/mse_loss: 0.013453237237986066 INFO: train/mse_loss: 0.013306563555842919 INFO: epoch: 4 INFO: iters: 93952 INFO: train/mse_loss: 0.012647044245178611 INFO: train/mse_loss: 0.012807737045385967 INFO: train/mse_loss: 0.011968838741840434 INFO: epoch: 5 INFO: iters: 117440 INFO:
------------------------------Validation Start------------------------------ INFO: val/mae: 0.3139403760433197 INFO: ------------------------------Validation End------------------------------
INFO: train/mse_loss: 0.011829124199711352 INFO: epoch: 6 INFO: iters: 140938 INFO: train/mse_loss: 0.010201521161369924 INFO: epoch: 7 INFO: iters: 164426 INFO: train/mse_loss: 0.010018873226117376 INFO: epoch: 8 INFO: iters: 187914 INFO: train/mse_loss: 0.009995935927926308 INFO: epoch: 9 INFO: iters: 211402 INFO: train/mse_loss: 0.009544536813287326 INFO: epoch: 10 INFO: iters: 234890 INFO: Saving the self at the end of epoch 10 INFO:
------------------------------Validation Start------------------------------ INFO: val/mae: 0.43820616602897644 INFO: ------------------------------Validation End------------------------------ '''
没有,扩散模型的损失函数计算是计算噪声和预测噪声间的mse_loss,详见:https://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/issues/26#issue -1282232897。而且扩散模型的推理也存在很大的随机性,出现这种情况很正常
23-09-03 04:09:21.974 - INFO: train/mse_loss: 0.004320403648868778
23-09-03 04:09:21.974 - INFO: epoch: 1423
23-09-03 04:09:21.974 - INFO: iters: 2072372
23-09-03 04:09:21.974 - INFO: Saving the self at the end of epoch 1423
23-09-03 04:09:23.265 - INFO:
------------------------------Validation Start------------------------------
23-09-03 04:20:12.848 - INFO: val/1-ssim: 0.1557578444480896
23-09-03 04:20:12.848 - INFO:
------------------------------Validation End------------------------------
23-09-03 04:23:16.320 - INFO: train/mse_loss: 0.004661682129078468
23-09-03 04:23:16.320 - INFO: epoch: 1424
23-09-03 04:23:16.320 - INFO: iters: 2073832
23-09-03 04:23:16.320 - INFO: Saving the self at the end of epoch 1424
23-09-03 04:23:17.622 - INFO:
------------------------------Validation Start------------------------------
23-09-03 04:34:06.690 - INFO: val/1-ssim: 0.10180902481079102
23-09-03 04:34:06.690 - INFO:
------------------------------Validation End------------------------------
23-09-03 04:37:05.177 - INFO: train/mse_loss: 0.004233014806692961
23-09-03 04:37:05.177 - INFO: epoch: 1425
23-09-03 04:37:05.177 - INFO: iters: 2075292
23-09-03 04:37:05.177 - INFO: Saving the self at the end of epoch 1425
23-09-03 04:37:06.475 - INFO:
------------------------------Validation Start------------------------------
23-09-03 04:47:56.020 - INFO: val/1-ssim: 0.1559600830078125
23-09-03 04:47:56.020 - INFO:
------------------------------Validation End------------------------------
23-09-03 04:50:55.078 - INFO: train/mse_loss: 0.004784488215476157
23-09-03 04:50:55.078 - INFO: epoch: 1426
23-09-03 04:50:55.078 - INFO: iters: 2076752
23-09-03 04:50:55.078 - INFO: Saving the self at the end of epoch 1426
23-09-03 04:50:56.547 - INFO:
------------------------------Validation Start------------------------------ 23-09-03 05:01:45.988 - INFO: val/1-ssim: 0.06806707382202148
您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几
For me, it happens as well. The training loss decreases very quickly and drops to 0.02 after 5 epochs, but the validation result is bad as hell. Someone has an idea?
@yuanc3 @1228967342 @AlanZhang1995
您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几
I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training.
我最开始训练效果也不好,用默认的mse损失函数训练1200epoch着色的图片会偏色严重,后来换了一个损失函数好一些,但是没有碰到val上只能生成噪声的情况。附带一些效果不好的val图片
我的训练集只有1500张,但是更换损失函数之后在测试集上的效果还可以,10k张图应该没那么容易过拟合,可以试试用训练的图片跑试试,可能连训练集上都没办法取得很好的着色效果,建议换一下损失函数试试,我目前的训练效果还可以 ------------------ 原始邮件 ------------------ 发件人: "Janspiry/Palette-Image-to-Image-Diffusion-Models" @.>; 发送时间: 2023年10月11日(星期三) 凌晨2:44 @.>; @.**@.>; 主题: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37)
您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几
I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
我最开始训练效果也不好,用默认的mse损失函数训练1200epoch着色的图片会偏色严重,后来换了一个损失函数好一些,但是没有碰到val上只能生成噪声的情况。附带一些效果不好的val图片 我的训练集只有1500张,但是更换损失函数之后在测试集上的效果还可以,10k张图应该没那么容易过拟合,可以试试用训练的图片跑试试,可能连训练集上都没办法取得很好的着色效果,建议换一下损失函数试试,我目前的训练效果还可以 ------------------ 原始邮件 ------------------ 发件人: "Janspiry/Palette-Image-to-Image-Diffusion-Models" @.>; 发送时间: 2023年10月11日(星期三) 凌晨2:44 @.>; @.**@.>; 主题: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37) 您好,我目前的问题貌似就是过拟合,我的数据集10k张图,在两块3090上训练了12个小时,然后val上只能生成噪声了。val loss也一直0.7几 I think it is very normal actually, that the val loss is much larger than the training loss, considering that the loss value is calculated differently during inference than training. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Heyy Glad to hear it! At least it proves the correctness of this repo. Would you mind sharing more details of the loss function? Is it like a image-level loss function for example structure-similarity loss?
BW, Jingsong
The hybrid loss function you mentioned is a mixture of true variational lower bound and BCE.
Am i correct?
Von: 1228967342 @.***> Gesendet: Montag, 23. Oktober 2023 17:10:26 An: Janspiry/Palette-Image-to-Image-Diffusion-Models Cc: Jingsong Liu; Comment Betreff: Re: [Janspiry/Palette-Image-to-Image-Diffusion-Models] Colorization training isn't working (Issue #37)
混合损失函数
― Reply to this email directly, view it on GitHubhttps://github.com/Janspiry/Palette-Image-to-Image-Diffusion-Models/issues/37#issuecomment-1775424685, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARFVZL5XXPK77A2XZTUCMZTYA2CGFAVCNFSM6AAAAAAQCBIWN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGQZDINRYGU. You are receiving this because you commented.Message ID: @.***>
您提到的混合损失函数是真实变分下界和 BCE 的混合。我对么? …… ____ Von: 1228967342 @.> Gesendet: Montag, 23. Oktober 2023 17:10:26 An: Janspiry/Palette-Image-to-Image-Diffusion-Models 抄送:Jingsong Liu;Comment Betreff:回复:[Janspiry/Palette-Image-to-Image-Diffusion-Models] 着色训练不起作用(问题#37)混合损失函数 ― 直接回复此电子邮件,在 GitHub 上查看< #37(评论) >,或取消订阅< https://github.com/notifications/unsubscribe-auth/ARFVZL5XXPK77A2XZTUCMZTYA2CGFAVCNFSM6AAAAAAQCBIWN6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZVGQZDINRYGU >。您收到此消息是因为您发表了评论。消息 ID:@.>
不是,只是很简单的混合
@1228967342 你好,我也遇到了相同的问题,请问可以分享一下你的混合损失设计吗
I downloaded the flicker25k dataset, preprocessed it and train a model with these modifications in the config file:
The rest of the configurations remained as in the current config file. Even after 1000 training epochs, the model still produces bad results.
Is there anything I'm missing? Thanks.