Runtime error on training run

crazyMage17 commented 2 years ago

I have a windows 10 set up with 4 Nvidia Quadro RTX 8GB graphics card. As far as i can tell all the drivers are up to date i am using CUDA 11.5, python 3.8.7. Can any one help out?

Using this command line

python train.py --outdir=F:/training-runs --cfg=stylegan3-t --data=F:/datasets/tester.zip --gpus=4 --batch=32 --gamma=32 --batch-gpu=2 --mbstd-group=2

It runs through the first stages okay, then hit the error at the end.

Output from the command

Num images: 10012 Image shape: [3, 1024, 1024] Label shape: [0]

Constructing networks... Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "filtered_lrelu_plugin"... Done.

Generator Parameters Buffers Output shape Datatype

mapping.fc0 262656 - [2, 512] float32 mapping.fc1 262656 - [2, 512] float32 mapping - 512 [2, 16, 512] float32 synthesis.input.affine 2052 - [2, 4] float32 synthesis.input 262144 1545 [2, 512, 36, 36] float32 synthesis.L0_36_512.affine 262656 - [2, 512] float32 synthesis.L0_36_512 2359808 25 [2, 512, 36, 36] float32 synthesis.L1_36_512.affine 262656 - [2, 512] float32 synthesis.L1_36_512 2359808 25 [2, 512, 36, 36] float32 synthesis.L2_52_512.affine 262656 - [2, 512] float32 synthesis.L2_52_512 2359808 37 [2, 512, 52, 52] float32 synthesis.L3_52_512.affine 262656 - [2, 512] float32 synthesis.L3_52_512 2359808 25 [2, 512, 52, 52] float32 synthesis.L4_84_512.affine 262656 - [2, 512] float32 synthesis.L4_84_512 2359808 37 [2, 512, 84, 84] float32 synthesis.L5_148_512.affine 262656 - [2, 512] float32 synthesis.L5_148_512 2359808 37 [2, 512, 148, 148] float16 synthesis.L6_148_512.affine 262656 - [2, 512] float32 synthesis.L6_148_512 2359808 25 [2, 512, 148, 148] float16 synthesis.L7_276_323.affine 262656 - [2, 512] float32 synthesis.L7_276_323 1488707 37 [2, 323, 276, 276] float16 synthesis.L8_276_203.affine 165699 - [2, 323] float32 synthesis.L8_276_203 590324 25 [2, 203, 276, 276] float16 synthesis.L9_532_128.affine 104139 - [2, 203] float32 synthesis.L9_532_128 233984 37 [2, 128, 532, 532] float16 synthesis.L10_1044_81.affine 65664 - [2, 128] float32 synthesis.L10_1044_81 93393 37 [2, 81, 1044, 1044] float16 synthesis.L11_1044_51.affine 41553 - [2, 81] float32 synthesis.L11_1044_51 37230 25 [2, 51, 1044, 1044] float16 synthesis.L12_1044_32.affine 26163 - [2, 51] float32 synthesis.L12_1044_32 14720 25 [2, 32, 1044, 1044] float16 synthesis.L13_1024_32.affine 16416 - [2, 32] float32 synthesis.L13_1024_32 9248 25 [2, 32, 1024, 1024] float16 synthesis.L14_1024_3.affine 16416 - [2, 32] float32 synthesis.L14_1024_3 99 1 [2, 3, 1024, 1024] float16 synthesis - - [2, 3, 1024, 1024] float32

Total 22313167 2480 - -

Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Discriminator Parameters Buffers Output shape Datatype

b1024.fromrgb 128 16 [2, 32, 1024, 1024] float16 b1024.skip 2048 16 [2, 64, 512, 512] float16 b1024.conv0 9248 16 [2, 32, 1024, 1024] float16 b1024.conv1 18496 16 [2, 64, 512, 512] float16 b1024 - 16 [2, 64, 512, 512] float16 b512.skip 8192 16 [2, 128, 256, 256] float16 b512.conv0 36928 16 [2, 64, 512, 512] float16 b512.conv1 73856 16 [2, 128, 256, 256] float16 b512 - 16 [2, 128, 256, 256] float16 b256.skip 32768 16 [2, 256, 128, 128] float16 b256.conv0 147584 16 [2, 128, 256, 256] float16 b256.conv1 295168 16 [2, 256, 128, 128] float16 b256 - 16 [2, 256, 128, 128] float16 b128.skip 131072 16 [2, 512, 64, 64] float16 b128.conv0 590080 16 [2, 256, 128, 128] float16 b128.conv1 1180160 16 [2, 512, 64, 64] float16 b128 - 16 [2, 512, 64, 64] float16 b64.skip 262144 16 [2, 512, 32, 32] float32 b64.conv0 2359808 16 [2, 512, 64, 64] float32 b64.conv1 2359808 16 [2, 512, 32, 32] float32 b64 - 16 [2, 512, 32, 32] float32 b32.skip 262144 16 [2, 512, 16, 16] float32 b32.conv0 2359808 16 [2, 512, 32, 32] float32 b32.conv1 2359808 16 [2, 512, 16, 16] float32 b32 - 16 [2, 512, 16, 16] float32 b16.skip 262144 16 [2, 512, 8, 8] float32 b16.conv0 2359808 16 [2, 512, 16, 16] float32 b16.conv1 2359808 16 [2, 512, 8, 8] float32 b16 - 16 [2, 512, 8, 8] float32 b8.skip 262144 16 [2, 512, 4, 4] float32 b8.conv0 2359808 16 [2, 512, 8, 8] float32 b8.conv1 2359808 16 [2, 512, 4, 4] float32 b8 - 16 [2, 512, 4, 4] float32 b4.mbstd - - [2, 513, 4, 4] float32 b4.conv 2364416 16 [2, 512, 4, 4] float32 b4.fc 4194816 - [2, 512] float32 b4.out 513 - [2, 1] float32

Total 29012513 544 - -

Setting up augmentation... Distributing across 4 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...

Traceback (most recent call last): File "train.py", line 286, in main() # pylint: disable=no-value-for-parameter File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\click\core.py", line 829, in call return self.main(args, kwargs) File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\click\core.py", line 610, in invoke return callback(args, **kwargs) File "train.py", line 281, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "train.py", line 98, in launch_training torch.multiprocessing.spawn(fn=subprocess_fn, args=(c, temp_dir), nprocs=c.num_gpus) File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 188, in start_processes while not context.join(): File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\multiprocessing\spawn.py", line 59, in _wrap fn(i, *args) File "C:\stylegan3-main\train.py", line 47, in subprocess_fn training_loop.training_loop(rank=rank, **c) File "C:\stylegan3-main\training\training_loop.py", line 287, in training_loop torch.distributed.all_reduce(flat) File "C:\Users\admin\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1292, in all_reduce work.wait() RuntimeError: [..\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 1800000ms for recv operation to complete

814087820 commented 1 year ago

Did you fix this, please?

lijain commented 1 year ago

Did you fix this, please？

NVlabs / stylegan3

Runtime error on training run #100