The model collapsed after 5,000 kimg !!!

thusinh1969 commented 3 years ago

Describe the bug {"results": {"fid50k_full": 63.96892505971819}, "metric": "fid50k_full", "total_time": 531.3764505386353, "total_time_str": "8m 51s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-001000.pkl", "timestamp": 1628160268.1425486} {"results": {"fid50k_full": 58.525298795452244}, "metric": "fid50k_full", "total_time": 499.5405957698822, "total_time_str": "8m 20s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-001200.pkl", "timestamp": 1628173970.3931735} {"results": {"fid50k_full": 59.17925678982728}, "metric": "fid50k_full", "total_time": 511.87381625175476, "total_time_str": "8m 32s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-001400.pkl", "timestamp": 1628187664.9752276} {"results": {"fid50k_full": 58.91571258942934}, "metric": "fid50k_full", "total_time": 490.9983820915222, "total_time_str": "8m 11s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-001600.pkl", "timestamp": 1628201344.2747865} {"results": {"fid50k_full": 65.731556858339}, "metric": "fid50k_full", "total_time": 535.4826281070709, "total_time_str": "8m 55s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-001800.pkl", "timestamp": 1628215060.7330396} {"results": {"fid50k_full": 63.375963647276905}, "metric": "fid50k_full", "total_time": 511.8039803504944, "total_time_str": "8m 32s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-002000.pkl", "timestamp": 1628228768.9927657} {"results": {"fid50k_full": 56.59078343235615}, "metric": "fid50k_full", "total_time": 486.64528465270996, "total_time_str": "8m 07s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-002200.pkl", "timestamp": 1628242459.7890496} {"results": {"fid50k_full": 64.13712240926766}, "metric": "fid50k_full", "total_time": 518.2649567127228, "total_time_str": "8m 38s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-002400.pkl", "timestamp": 1628256193.5893624} {"results": {"fid50k_full": 61.16040457778731}, "metric": "fid50k_full", "total_time": 512.4933640956879, "total_time_str": "8m 32s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-002600.pkl", "timestamp": 1628269920.9185064} {"results": {"fid50k_full": 65.9086423577536}, "metric": "fid50k_full", "total_time": 513.1752188205719, "total_time_str": "8m 33s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-002800.pkl", "timestamp": 1628283618.4029875} {"results": {"fid50k_full": 58.66206269264016}, "metric": "fid50k_full", "total_time": 504.11704897880554, "total_time_str": "8m 24s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-003000.pkl", "timestamp": 1628297485.3196752} {"results": {"fid50k_full": 393.68631320557506}, "metric": "fid50k_full", "total_time": 498.02411937713623, "total_time_str": "8m 18s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-003200.pkl", "timestamp": 1628311353.32463} {"results": {"fid50k_full": 663.4169879390231}, "metric": "fid50k_full", "total_time": 499.12365078926086, "total_time_str": "8m 19s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-003400.pkl", "timestamp": 1628325105.837137} {"results": {"fid50k_full": 441.2410973363145}, "metric": "fid50k_full", "total_time": 532.7644124031067, "total_time_str": "8m 53s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-003600.pkl", "timestamp": 1628338763.0266988} {"results": {"fid50k_full": 384.7078889179583}, "metric": "fid50k_full", "total_time": 486.72313618659973, "total_time_str": "8m 07s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-003800.pkl", "timestamp": 1628352413.0977838} {"results": {"fid50k_full": 360.7007085293537}, "metric": "fid50k_full", "total_time": 485.4319291114807, "total_time_str": "8m 05s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-004000.pkl", "timestamp": 1628365983.0285559} {"results": {"fid50k_full": 402.3055173261641}, "metric": "fid50k_full", "total_time": 482.9598653316498, "total_time_str": "8m 03s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-004200.pkl", "timestamp": 1628379510.9942532} {"results": {"fid50k_full": 663.417733276038}, "metric": "fid50k_full", "total_time": 480.29705834388733, "total_time_str": "8m 00s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-004400.pkl", "timestamp": 1628393103.6210883} {"results": {"fid50k_full": 663.417733276038}, "metric": "fid50k_full", "total_time": 493.3043894767761, "total_time_str": "8m 13s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-004600.pkl", "timestamp": 1628406707.223027} {"results": {"fid50k_full": 663.417733276038}, "metric": "fid50k_full", "total_time": 502.3715898990631, "total_time_str": "8m 22s", "num_gpus": 2, "snapshot_pkl": "network-snapshot-004800.pkl", "timestamp": 1628420324.29108}

Training: python train.py --outdir="./results/" --gpus=2 --batch=16 --data="./images" --cond=True --mirror=True --cfg=paper512 ----kimg=25000 --aug=ada --resume="./results/network-snapshot-001400.pkl"

All generated samples are black! What a wast of time. Can someone help why ? Augmentation ? Same model, same codes, no change. Custom dataset 200k images, 20 classes on indoor and outdoor furniture.

Thanks, Steve

thusinh1969 commented 3 years ago

I am changing cfg to stylegan2 (which has huge change in gamma and ema), and reduce learning rate of both G and D with 2 models run differently: 1e-4 and 3e-4 (apply for both). Constinue training from the last non-collapsed one. Let see.

Steve

Continue7777 commented 3 years ago

use aug=noaug looks like better, but is still leading to mode collapsed.

thusinh1969 commented 3 years ago

use aug=noaug looks like better, but is still leading to mode collapsed.

What do you mean by "looks like better" ?

Weird that exact model lead to mode collapsed ?

Steve

Continue7777 commented 3 years ago

i fixed my dataset and until now the output seems good 1628732719(1)

thusinh1969 commented 3 years ago

i fixed my dataset and until now the output seems good

What happened to the "weird parts" in the body and face? I have it on my results as well after 25,000 kmg...!

Continue7777 commented 3 years ago

i have trained the fid to 9,it still has these bad part, it's may be the size of the dataset（7.5k after flipx） and alignments. Training is too much expensive o(╥﹏╥)o

thusinh1969 commented 3 years ago

i have trained the fid to 9,it still has these bad part, it's may be the size of the dataset（7.5k after flipx） and alignments. Training is too much expensive o(╥﹏╥)o

My dataset is large, near 100k. FiD never close to 40, training over 50,000 kimg now !!! Costed me 5k USD on GCP already. Something is not right.

Steve

moyix commented 3 years ago

I think I'm seeing the same bug - after 6600 kimg (dataset size is ~150K images) FID shot up from ~7.8 to 300-500. The samples look like:

https://imgur.com/a/OOn7EOw

zhanjiahui commented 3 years ago

I have a similar bug too!! After 10080kimge:

After 25000 kimg:

Maybe color transformations are leaky to my generator? The augment in my log.txt looks terrible too.

49xxy commented 2 years ago

我也有类似的bug！！ 10080公里后：

25000 公里后：

也许颜色转换对我的生成器有泄漏？我的 log.txt 中的扩充看起来也很糟糕。

hi,why your fakes are vortical

zhanjiahui commented 2 years ago

我也有类似的bug！！ 10080公里后： 25000 公里后：也许颜色转换对我的生成器有泄漏？我的 log.txt 中的扩充看起来也很糟糕。

hi,why your fakes are vortical

I don't know, maybe it related to data augment.

xLuge commented 2 years ago

I have the same problem, how should I solve it.

JavinYang commented 1 year ago

I have the same problem, how should I solve it.

Has this problem been solved? I have the same problem.

JavinYang commented 1 year ago

I think I'm seeing the same bug - after 6600 kimg (dataset size is ~150K images) FID shot up from ~7.8 to 300-500. The samples look like:

https://imgur.com/a/OOn7EOw

why?

NVlabs / stylegan2-ada-pytorch

The model collapsed after 5,000 kimg !!! #159