Zhendong-Wang / Diffusion-GAN

Official PyTorch implementation for paper: Diffusion-GAN: Training GANs with Diffusion
MIT License
626 stars 67 forks source link

RuntimeError: Both events must be recorded before calculating elapsed time. #22

Closed octadion closed 1 year ago

octadion commented 1 year ago

Describing bug: crashes after evaluating metrics

Training: diffusion-stylegan2

Environment: same with environment.yml

Cfg: cfg=paper256 --aug no --target 0.6 --noise_sd 0.05 --ts_dist priority

Loading training set...

Num images: 20000 Image shape: [3, 256, 256] Label shape: [0]

Constructing networks... Setting up augmentation... Resuming from "/home/octa/diffusion-gan/Diffusion-GAN/pretrained/diffusion-stylegan2-lsun-bedroom.pkl" Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Generator Parameters Buffers Output shape Datatype


mapping.fc0 262656 - [16, 512] float32 mapping.fc1 262656 - [16, 512] float32 mapping.fc2 262656 - [16, 512] float32 mapping.fc3 262656 - [16, 512] float32 mapping.fc4 262656 - [16, 512] float32 mapping.fc5 262656 - [16, 512] float32 mapping.fc6 262656 - [16, 512] float32 mapping.fc7 262656 - [16, 512] float32 mapping - 512 [16, 14, 512] float32 synthesis.b4.conv1 2622465 32 [16, 512, 4, 4] float32 synthesis.b4.torgb 264195 - [16, 3, 4, 4] float32 synthesis.b4:0 8192 16 [16, 512, 4, 4] float32 synthesis.b4:1 - - [16, 512, 4, 4] float32 synthesis.b8.conv0 2622465 80 [16, 512, 8, 8] float32 synthesis.b8.conv1 2622465 80 [16, 512, 8, 8] float32 synthesis.b8.torgb 264195 - [16, 3, 8, 8] float32 synthesis.b8:0 - 16 [16, 512, 8, 8] float32 synthesis.b8:1 - - [16, 512, 8, 8] float32 synthesis.b16.conv0 2622465 272 [16, 512, 16, 16] float32 synthesis.b16.conv1 2622465 272 [16, 512, 16, 16] float32 synthesis.b16.torgb 264195 - [16, 3, 16, 16] float32 synthesis.b16:0 - 16 [16, 512, 16, 16] float32 synthesis.b16:1 - - [16, 512, 16, 16] float32 synthesis.b32.conv0 2622465 1040 [16, 512, 32, 32] float16 synthesis.b32.conv1 2622465 1040 [16, 512, 32, 32] float16 synthesis.b32.torgb 264195 - [16, 3, 32, 32] float16 synthesis.b32:0 - 16 [16, 512, 32, 32] float16 synthesis.b32:1 - - [16, 512, 32, 32] float32 synthesis.b64.conv0 1442561 4112 [16, 256, 64, 64] float16 synthesis.b64.conv1 721409 4112 [16, 256, 64, 64] float16 synthesis.b64.torgb 132099 - [16, 3, 64, 64] float16 synthesis.b64:0 - 16 [16, 256, 64, 64] float16 synthesis.b64:1 - - [16, 256, 64, 64] float32 synthesis.b128.conv0 426369 16400 [16, 128, 128, 128] float16 synthesis.b128.conv1 213249 16400 [16, 128, 128, 128] float16 synthesis.b128.torgb 66051 - [16, 3, 128, 128] float16 synthesis.b128:0 - 16 [16, 128, 128, 128] float16 synthesis.b128:1 - - [16, 128, 128, 128] float32 synthesis.b256.conv0 139457 65552 [16, 64, 256, 256] float16 synthesis.b256.conv1 69761 65552 [16, 64, 256, 256] float16 synthesis.b256.torgb 33027 - [16, 3, 256, 256] float16 synthesis.b256:0 - 16 [16, 64, 256, 256] float16 synthesis.b256:1 - - [16, 64, 256, 256] float32


Total 24767458 175568 - -

Discriminator Parameters Buffers Output shape Datatype


b256.fromrgb 256 16 [16, 64, 256, 256] float16 b256.skip 8192 16 [16, 128, 128, 128] float16 b256.conv0 36928 16 [16, 64, 256, 256] float16 b256.conv1 73856 16 [16, 128, 128, 128] float16 b256 - 16 [16, 128, 128, 128] float16 b128.skip 32768 16 [16, 256, 64, 64] float16 b128.conv0 147584 16 [16, 128, 128, 128] float16 b128.conv1 295168 16 [16, 256, 64, 64] float16 b128 - 16 [16, 256, 64, 64] float16 b64.skip 131072 16 [16, 512, 32, 32] float16 b64.conv0 590080 16 [16, 256, 64, 64] float16 b64.conv1 1180160 16 [16, 512, 32, 32] float16 b64 - 16 [16, 512, 32, 32] float16 b32.skip 262144 16 [16, 512, 16, 16] float16 b32.conv0 2359808 16 [16, 512, 32, 32] float16 b32.conv1 2359808 16 [16, 512, 16, 16] float16 b32 - 16 [16, 512, 16, 16] float16 b16.skip 262144 16 [16, 512, 8, 8] float32 b16.conv0 2359808 16 [16, 512, 16, 16] float32 b16.conv1 2359808 16 [16, 512, 8, 8] float32 b16 - 16 [16, 512, 8, 8] float32 b8.skip 262144 16 [16, 512, 4, 4] float32 b8.conv0 2359808 16 [16, 512, 8, 8] float32 b8.conv1 2359808 16 [16, 512, 4, 4] float32 b8 - 16 [16, 512, 4, 4] float32 mapping.embed 1024 - [16, 512] float32 mapping.fc0 262656 - [16, 512] float32 mapping.fc1 262656 - [16, 512] float32 mapping.fc2 262656 - [16, 512] float32 mapping.fc3 262656 - [16, 512] float32 mapping.fc4 262656 - [16, 512] float32 mapping.fc5 262656 - [16, 512] float32 mapping.fc6 262656 - [16, 512] float32 mapping.fc7 262656 - [16, 512] float32 b4.mbstd - - [16, 513, 4, 4] float32 b4.conv 2364416 16 [16, 512, 4, 4] float32 b4.fc 4194816 - [16, 512] float32 b4.out 262656 - [16, 512] float32 b4 - - [16, 1] float32


Total 26365504 416 - -

Distributing across 1 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Skipping tfevents export: No module named 'tensorboard' Training for 25000 kimg...

tick 12450 kimg 50212.2 time 1m 20s sec/tick 12.6 sec/kimg 197.43 maintenance 66.9 cpumem 3.70 gpumem 11.26 augment 0.997 T 10.0 Evaluating metrics... {"results": {"fid50k_full": 271.45591156832063}, "metric": "fid50k_full", "total_time": 540.888575553894, "total_time_str": "9m 01s", "num_gpus": 1, "snapshot_pkl": "network-snapshot.pkl", "timestamp": 1680022555.4954345} Traceback (most recent call last): File "/home/octa/diffusion-gan/Diffusion-GAN/diffusion-stylegan2/train.py", line 531, in main() # pylint: disable=no-value-for-parameter File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/click/core.py", line 1128, in call return self.main(args, kwargs) File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(args, *kwargs) File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), args, kwargs) File "/home/octa/diffusion-gan/Diffusion-GAN/diffusion-stylegan2/train.py", line 524, in main subprocess_fn(rank=0, args=args, temp_dir=temp_dir) File "/home/octa/diffusion-gan/Diffusion-GAN/diffusion-stylegan2/train.py", line 357, in subprocess_fn training_loop.training_loop(rank=rank, args) File "/home/octa/diffusion-gan/Diffusion-GAN/diffusion-stylegan2/training/training_loop.py", line 437, in training_loop value = phase.start_event.elapsed_time(phase.end_event) File "/home/octa/anaconda3/envs/difgan/lib/python3.9/site-packages/torch/cuda/streams.py", line 204, in elapsed_time return super(Event, self).elapsed_time(end_event) RuntimeError: Both events must be recorded before calculating elapsed time.

octadion commented 1 year ago

oh my bad. the isssue is actually solved in #9

Jiangshouyu1 commented 6 months ago

hello,May I ask how you resolved it?

Jiangshouyu1 commented 6 months ago

@octadion