ShuhongChen / panic3d-anime-reconstruction

CVPR 2023: PAniC-3D Stylized Single-view 3D Reconstruction from Portraits of Anime Characters
https://github.com/ShuhongChen/panic3d-anime-reconstruction
753 stars 63 forks source link

'elevations' key error in training script and performance questions #36

Closed TomoyaYamada194 closed 11 months ago

TomoyaYamada194 commented 11 months ago

Sorry for asking so many questions. I get the following error Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 542, in <module> main() # pylint: disable=no-value-for-parameter File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 537, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 112, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 61, in subprocess_fn getattr(training_loop, c.training_loop_version)(rank=rank, **c) File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/training_loop_v0.py", line 360, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, real_cond=real_cond, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg) File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/loss_orthocondA.py", line 483, in accumulate_gradients gen_img, _gen_ws = self.run_G(gen_z, gen_c, real_cond, swapping_prob=swapping_prob, neural_rendering_resolution=neural_rendering_resolution) File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/loss_orthocondA.py", line 171, in run_G gen_output = self.G.f({ File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/triplane.py", line 404, in f x['elevations'][i], KeyError: 'elevations' Commenting out x['elevations'][i] causes x['azimuths'][0] to be a key error. You also use 'elevation' in generate.py and other scripts, do you use what is defined in those scripts?

Also, my environment is as follows

CPU : Core™ i7-13700K GPU : GeForce RTX 4070 Memory : 32GB Drive : 2TB

We are currently adjusting batch size, etc., but is it still too hard to run with this computer performance? I would like to get your help in terms of views, parameter settings, etc.

ShuhongChen commented 11 months ago

KeyError: 'elevations' Commenting out x['elevations'][i] causes x['azimuths'][0] to be a key error. You also use 'elevation' in generate.py and other scripts, do you use what is defined in those scripts?

If there's a key error, it means there's no elevations in the Dict, please check the contents of x. Elevation means how high the camera is placed in degrees, and so would be different values depending on the purpose of the script

We are currently adjusting batch size, etc., but is it still too hard to run with this computer performance?

If I remember correctly, the original model was trained with 8 gpus, each with at least 20gb vram; looking at your setup, you might be able to run it with minimal batch settings

TomoyaYamada194 commented 11 months ago

Thank you for your reply

As for the key error, I used the setting that existed on lines 595-596 in the same script by declaring it in the function.

https://github.com/ShuhongChen/panic3d-anime-reconstruction/blob/ed49f931b0fbd7b73f484d10f723d9455a943793/_train/eg3dc/src/training/triplane.py#L595-L596

I have resolved the key error, but now I get the following error

 Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 543, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 538, in main
    launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run)
  File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 113, in launch_training
    subprocess_fn(rank=0, c=c, temp_dir=temp_dir)
  File "/home/dl/prodata/panic3d-anime-reconstruction/_train/eg3dc/trainers/train_eclustrousC.py", line 62, in subprocess_fn
    getattr(training_loop, c.training_loop_version)(rank=rank, **c)
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/training_loop_v0.py", line 360, in training_loop
    loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, real_cond=real_cond, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg)
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/loss_orthocondA.py", line 483, in accumulate_gradients
    gen_img, _gen_ws = self.run_G(gen_z, gen_c, real_cond, swapping_prob=swapping_prob, neural_rendering_resolution=neural_rendering_resolution)
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/loss_orthocondA.py", line 171, in run_G
    gen_output = self.G.f({
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/triplane.py", line 475, in f
    synth = self.synthesis(
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/triplane.py", line 193, in synthesis
    planes = self.backbone.synthesis(ws, cond, update_emas=update_emas, latent_injection=latent_injection, stop_level=stop_level, **synthesis_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/networks_stylegan2.py", line 550, in forward
    x, img = block(x, img, cur_ws, **block_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/training/networks_stylegan2.py", line 460, in forward
    misc.assert_shape(x, [None, self.in_channels, self.resolution // 2, self.resolution // 2])
  File "/home/dl/prodata/panic3d-anime-reconstruction/./_train/eg3dc/src/torch_utils/misc.py", line 97, in assert_shape
    raise AssertionError(f'Wrong size for dimension {idx}: got {size}, expected {ref_size}')
AssertionError: Wrong size for dimension 1: got 3, expected 1

I would like to know about tensors and codes that need to be improved.

−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− The options currently in use include

 --training_loop_version=training_loop_v0 \
 --loss_module=training.loss.StyleGAN2LossOrthoCondA \
 --cond_mode=ortho_front.cond_img_norm_4.concatfront.crossavg_4.reschonk_add_512.inj_6b_4 \
 --data_subset=rutileEB \
 --gpus=1\
 --batch=4 \
 --kimg=25000 \
 --snap=10 \
 --resume_discrim=./_data/eg3d/networks/ffhqrebalanced512-64.pkl \
 --triplane_depth=3 \
 --triplane_width=16 \
 --sr_channels_hidden=16 \
 --backbone_resolution=256 \
 --cbase_g=100 --cmax_g=32 \
 --cbase_d=32768 --cmax_d=512 \
 --neural_rendering_resolution_initial=48 \
 --neural_rendering_resolution_final=48 \
 --neural_rendering_resolution_fade_kimg=1000 \
 --gamma=5.0 \
 --density_reg=0.5 \
 --dlr=0.0010 \
 --glr=0.0025 \
 --blur_fade_kimg=0 \
 --lambda_gcond_lpips=20.0 \
 --lambda_gcond_l1=4.0 \
 --lambda_gcond_alpha_l2=1.0 \
 --lambda_gcond_depth_l2=1000.0 \
 --lambda_gcond_sides_lpips=20.0 \
 --lambda_gcond_sides_l1=4.0 \
 --lambda_gcond_sides_alpha_l2=1.0 \
 --lambda_gcond_sides_depth_l2=1000.0 \
 --lambda_gcond_back_lpips=20.0 \
 --lambda_gcond_back_l1=4.0 \
 --lambda_gcond_back_alpha_l2=1.0 \
 --lambda_gcond_back_depth_l2=1000.0`

I would like to know if there are any mistakes or advice on how to improve the operation.

ShuhongChen commented 11 months ago

As for the key error, I used the setting that existed on lines 595-596 in the same script by declaring it in the function.

The elevation and azimuth need to be set properly based on the true camera angle the training image was rendered from; if the parameters don't appear in x, it's better to debug the dataloader until it properly loads the right parameters, rather than set it to zero constant

I have resolved the key error, but now I get the following error

I would recommend debugging a print statement at inference time here to see the correct tensor shape, then compare it to what prints during your training that doesn't work

TomoyaYamada194 commented 11 months ago

The elevation key error was resolved by changing the point where the rutileE/ was used when creating the front.pkl file back to virtualyoutuberE/.

https://github.com/ShuhongChen/panic3d-anime-reconstruction/blob/ed49f931b0fbd7b73f484d10f723d9455a943793/_databacks/lustrous_renders_v1.py#L152-L153

Still no improvement regarding the shape of the tensor. Upon examination, I found that passing the following statement changes the form of the tensor I was expecting.

https://github.com/ShuhongChen/panic3d-anime-reconstruction/blob/ed49f931b0fbd7b73f484d10f723d9455a943793/_train/eg3dc/src/training/networks_stylegan2.py#L613-L623

Commenting out this part will result in the following error I would like to get some ideas on the solution.

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
ShuhongChen commented 11 months ago

The elevation key error was resolved by changing the point where the rutileE/ was used when creating the front.pkl file back to virtualyoutuberE/.

These are two different datasets with very different settings; substituting like this will not get what you want, the dataloader needs to properly load the correct camera parameters

Commenting out this part will result in the following error I would like to get some ideas on the solution.

Please don't comment parts out of the code, it won't work if you break the wrong things; I still recommend what I suggested last time:

I would recommend debugging a print statement at inference time here to see the correct tensor shape, then compare it to what prints during your training that doesn't work