"KeyError: 'rgb'" when training for monocular dataset

cocoshe commented 2 years ago

When I train for my own data, I run python train.py --cfg configs/human_nerf/wild/monocular/single_gpu.yaml with my terminal, here is the log:

# python train.py --cfg configs/human_nerf/wild/monocular/single_gpu.yaml
------------------ GPU Configurations ------------------
Primary GPUs: [0]
Secondary GPUs: [0]
--------------------------------------------------------

######################### CONFIG #########################

N_samples: 128
bbox_offset: 0.3
bgcolor: [255.0, 255.0, 255.0]
canonical_mlp:
  i_embed: 0
  mlp_depth: 8
  mlp_width: 256
  module: core.nets.human_nerf.canonical_mlps.mlp_rgb_sigma
  multires: 10
category: human_nerf
chunk: 32768
embedder:
  module: core.nets.human_nerf.embedders.fourier
eval_iter: 10000000
experiment: single_gpu
freeview:
  batch_size: 1
  dataset: monocular_test
  dataset_module: core.data.human_nerf.freeview
  drop_last: False
  frame_idx: 0
  shuffle: False
ignore_non_rigid_motions: False
load_net: latest
logdir: experiments/human_nerf/wild/monocular/single_gpu
lr_updater_module: core.train.trainers.human_nerf.lr_updaters.exp_decay
movement:
  batch_size: 1
  dataset: monocular_test
  dataset_module: core.data.human_nerf.train
  drop_last: False
  shuffle: False
mweight_volume:
  dst_voxel_size: 0.0625
  embedding_size: 256
  module: core.nets.human_nerf.mweight_vol_decoders.deconv_vol_decoder
  volume_size: 32
n_gpus: 1
netchunk_per_gpu: 300000
network_module: core.nets.human_nerf.network
non_rigid_embedder:
  module: core.nets.human_nerf.embedders.hannw_fourier
non_rigid_motion_mlp:
  condition_code_size: 69
  full_band_iter: 200000
  i_embed: 0
  kick_in_iter: 100000
  mlp_depth: 6
  mlp_width: 128
  module: core.nets.human_nerf.non_rigid_motion_mlps.mlp_offset
  multires: 6
  skips: [4]
num_workers: 4
optimizer_module: core.train.optimizers.human_nerf.optimizer
patch:
  N_patches: 6
  sample_subject_ratio: 0.8
  size: 20
perturb: 1.0
pose_decoder:
  embedding_size: 69
  kick_in_iter: 20000
  mlp_depth: 4
  mlp_width: 256
  module: core.nets.human_nerf.pose_decoders.mlp_delta_body_pose
primary_gpus: [0]
progress:
  batch_size: 1
  dataset: monocular_test
  dataset_module: core.data.human_nerf.train
  drop_last: False
  dump_interval: 5000
  shuffle: False
render_folder_name: 
render_frames: 100
render_skip: 1
resize_img_scale: 0.5
resume: False
save_all: True
secondary_gpus: [0]
sex: neutral
show_alpha: False
show_truth: False
subject: monocular
task: wild
test_keyfilter: ['rays', 'target_rgbs', 'motion_bases', 'motion_weights_priors', 'cnl_bbox', 'dst_posevec_69']
total_bones: 24
tpose:
  batch_size: 1
  dataset: monocular_test
  dataset_module: core.data.human_nerf.tpose
  drop_last: False
  shuffle: False
train:
  batch_size: 1
  dataset: monocular_train
  dataset_module: core.data.human_nerf.train
  drop_last: False
  log_interval: 20
  lossweights:
    lpips: 1.0
    mse: 0.2
  lr: 0.0005
  lr_mweight_vol_decoder: 5e-05
  lr_non_rigid_mlp: 5e-05
  lr_pose_decoder: 5e-05
  lrate_decay: 500
  maxiter: 400000
  optimizer: adam
  perturb: 1.0
  ray_shoot_mode: patch
  save_checkpt_interval: 2000
  save_model_interval: 50000
  shuffle: True
train_keyfilter: ['rays', 'motion_bases', 'motion_weights_priors', 'cnl_bbox', 'dst_posevec_69']
trainer_module: core.train.trainers.human_nerf.trainer

##########################################################

********** learnable parameters **********

mweight_vol_decoder.const_embedding: lr = 5e-05
mweight_vol_decoder.decoder.block_mlp.0.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_mlp.0.bias: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.0.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.0.bias: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.2.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.2.bias: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.4.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.4.bias: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.6.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.6.bias: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.8.weight: lr = 5e-05
mweight_vol_decoder.decoder.block_conv.8.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.0.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.0.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.2.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.2.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.4.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.4.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.6.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.6.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.8.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.8.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.10.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.10.bias: lr = 5e-05
non_rigid_mlp.module.block_mlps.12.weight: lr = 5e-05
non_rigid_mlp.module.block_mlps.12.bias: lr = 5e-05
cnl_mlp.module.pts_linears.0.weight: lr = 0.0005
cnl_mlp.module.pts_linears.0.bias: lr = 0.0005
cnl_mlp.module.pts_linears.2.weight: lr = 0.0005
cnl_mlp.module.pts_linears.2.bias: lr = 0.0005
cnl_mlp.module.pts_linears.4.weight: lr = 0.0005
cnl_mlp.module.pts_linears.4.bias: lr = 0.0005
cnl_mlp.module.pts_linears.6.weight: lr = 0.0005
cnl_mlp.module.pts_linears.6.bias: lr = 0.0005
cnl_mlp.module.pts_linears.8.weight: lr = 0.0005
cnl_mlp.module.pts_linears.8.bias: lr = 0.0005
cnl_mlp.module.pts_linears.10.weight: lr = 0.0005
cnl_mlp.module.pts_linears.10.bias: lr = 0.0005
cnl_mlp.module.pts_linears.12.weight: lr = 0.0005
cnl_mlp.module.pts_linears.12.bias: lr = 0.0005
cnl_mlp.module.pts_linears.14.weight: lr = 0.0005
cnl_mlp.module.pts_linears.14.bias: lr = 0.0005
cnl_mlp.module.output_linear.0.weight: lr = 0.0005
cnl_mlp.module.output_linear.0.bias: lr = 0.0005
pose_decoder.block_mlps.0.weight: lr = 5e-05
pose_decoder.block_mlps.0.bias: lr = 5e-05
pose_decoder.block_mlps.2.weight: lr = 5e-05
pose_decoder.block_mlps.2.bias: lr = 5e-05
pose_decoder.block_mlps.4.weight: lr = 5e-05
pose_decoder.block_mlps.4.bias: lr = 5e-05
pose_decoder.block_mlps.6.weight: lr = 5e-05
pose_decoder.block_mlps.6.bias: lr = 5e-05
pose_decoder.block_mlps.8.weight: lr = 5e-05
pose_decoder.block_mlps.8.bias: lr = 5e-05

******************************************

********** Init Trainer ***********
Save checkpoint to experiments/human_nerf/wild/monocular/single_gpu/init.tar ...
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /root/humannerf/third_parties/lpips/weights/v0.1/vgg.pth
Load Progress Dataset ...
[Dataset Path] dataset/wild/monocular
 -- Total Frames: 16
************************************
[Dataset Path] dataset/wild/monocular
 -- Total Frames: 2061
Traceback (most recent call last):
  File "train.py", line 32, in <module>
    main()
  File "train.py", line 26, in main
    train_dataloader=train_loader)
  File "core/train/trainers/human_nerf/trainer.py", line 151, in train
    div_indices=data['patch_div_indices'])
  File "core/train/trainers/human_nerf/trainer.py", line 104, in get_loss
    rgb = net_output['rgb']
KeyError: 'rgb'

I tried to print some values for debug:

********** Init Trainer ***********
Save checkpoint to experiments/human_nerf/wild/monocular/single_gpu/init.tar ...
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /root/humannerf/third_parties/lpips/weights/v0.1/vgg.pth
Load Progress Dataset ...
[Dataset Path] dataset/wild/monocular
 -- Total Frames: 16
************************************
[Dataset Path] dataset/wild/monocular
 -- Total Frames: 2061
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
dict_keys(['ray_mask', 'rays', 'near', 'far', 'bgcolor', 'patch_div_indices', 'patch_masks', 'target_patches', 'dst_Rs', 'dst_Ts', 'cnl_gtfms', 'motion_weights_priors', 'cnl_bbox_min_xyz', 'cnl_bbox_max_xyz', 'cnl_bbox_scale_xyz', 'dst_posevec', 'iter_val'])
tensor([  0,   0,   0, 400, 400, 400, 400], device='cuda:0')
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1111
dict_keys(['rgb', 'alpha', 'depth'])
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++2222
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
dict_keys(['ray_mask', 'rays', 'near', 'far', 'bgcolor', 'patch_div_indices', 'patch_masks', 'target_patches', 'dst_Rs', 'dst_Ts', 'cnl_gtfms', 'motion_weights_priors', 'cnl_bbox_min_xyz', 'cnl_bbox_max_xyz', 'cnl_bbox_scale_xyz', 'dst_posevec', 'iter_val'])
tensor([  0,   0, 400, 400, 400, 400, 400], device='cuda:0')
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1111
dict_keys(['rgb', 'alpha', 'depth'])
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++2222
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
dict_keys(['ray_mask', 'rays', 'near', 'far', 'bgcolor', 'patch_div_indices', 'patch_masks', 'target_patches', 'dst_Rs', 'dst_Ts', 'cnl_gtfms', 'motion_weights_priors', 'cnl_bbox_min_xyz', 'cnl_bbox_max_xyz', 'cnl_bbox_scale_xyz', 'dst_posevec', 'iter_val'])
tensor([0, 0, 0, 0, 0, 0, 0], device='cuda:0')
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1111
dict_keys([])
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++2222
Traceback (most recent call last):
  File "train.py", line 32, in <module>
    main()
  File "train.py", line 26, in main
    train_dataloader=train_loader)
  File "core/train/trainers/human_nerf/trainer.py", line 157, in train
    div_indices=data['patch_div_indices'])
  File "core/train/trainers/human_nerf/trainer.py", line 106, in get_loss
    rgb = net_output['rgb']
KeyError: 'rgb'

I just clone the project, and follow the readme:

mkdir for dataset/wild/monocular,
prepare the dataset, and in the dataset/wild/monocular:

But when I train, I failed.

cocoshe commented 2 years ago

And I just found the select_inds are empty in some of my batchs. Here's the log and debug print:

O@T_4%G92 4@R8B)GRK) HI

qzane commented 2 years ago

I have the same issue, but it only happens when I'm training with single_gpu.yaml, I's okay when I train with adventure.yaml. And it also happens very random, I got this error at the 6th epoch and there is already some progress Images like 'prog_000300.jpg' image generated successfully in previous epochs.

cocoshe commented 2 years ago

I have the same issue, but it only happens when I'm training with single_gpu.yaml, I's okay when I train with adventure.yaml. And it also happens very random, I got this error at the 6th epoch and there is already some progress Images like 'prog_000300.jpg' image generated successfully in previous epochs.

Yes, it happens very random, and I also checked the frame_name, and try to find the problem. But even when I prepare only 2 frames, the train is nothing wrong for some batchs at first, but then the input data is empty...I find select_inds is empty, so it can't get sample rays, that's why the input data is empty, so the output data is empty.

And I only have one GPU, can I run the adventure.yaml? Or you find out how to solve the single_gpu.yaml problem?

cocoshe commented 2 years ago

I have the same issue, but it only happens when I'm training with single_gpu.yaml, I's okay when I train with adventure.yaml. And it also happens very random, I got this error at the 6th epoch and there is already some progress Images like 'prog_000300.jpg' image generated successfully in previous epochs.

I just tried adventure.yaml, but got the same error, still rgb key error, maybe something wrong with my own dataset? Can you provide your dataset (maybe with google drive)? I stucked here for days and still can't figure out the reason, appreciate!

qzane commented 2 years ago

Well I'm just playing with the standard ZJU-Mocap dataset with subject 387. And I just find that if I'm training with my local desktop (which has two Quadro RTX 6000, 24GB, cuda 11.8), it never fail. And if Ilm training on remote server (which has two GTX 3090, 24GB, cuda 11.6), it will fail at some point. I really don't understand this either.

cocoshe commented 2 years ago

Well I'm just playing with the standard ZJU-Mocap dataset with subject 387. And I just find that if I'm training with my local desktop (which has two Quadro RTX 6000, 24GB, cuda 11.8), it never fail. And if Ilm training on remote server (which has two GTX 3090, 24GB, cuda 11.6), it will fail at some point. I really don't understand this either.

My computer gpu is AMD so I can't run it on my own computer. I run the code on the cloud server(Tesla V100, 32GB, cuda 11.0), then failed(rgb key error). Strange....I don't get it.

qzane commented 2 years ago

Well I'm just playing with the standard ZJU-Mocap dataset with subject 387. And I just find that if I'm training with my local desktop (which has two Quadro RTX 6000, 24GB, cuda 11.8), it never fail. And if Ilm training on remote server (which has two GTX 3090, 24GB, cuda 11.6), it will fail at some point. I really don't understand this either.

My computer gpu is AMD so I can't run it on my own computer. I run the code on the cloud server(Tesla V100, 32GB, cuda 11.0), then failed(rgb key error). Strange....I don't get it.

I have a walk-around about this issue. Edit the file humannerf/configs/config.py, change line 13 to _C.resume = True. Then when you run the training code, run like this:

for i in `seq 9999`;do python train.py --cfg XXXXX.yaml;done

This way, whenever the code break, it will resume from where it crashed.

cocoshe commented 2 years ago

Well I'm just playing with the standard ZJU-Mocap dataset with subject 387. And I just find that if I'm training with my local desktop (which has two Quadro RTX 6000, 24GB, cuda 11.8), it never fail. And if Ilm training on remote server (which has two GTX 3090, 24GB, cuda 11.6), it will fail at some point. I really don't understand this either.

My computer gpu is AMD so I can't run it on my own computer. I run the code on the cloud server(Tesla V100, 32GB, cuda 11.0), then failed(rgb key error). Strange....I don't get it.

I have a walk-around about this issue. Edit the file humannerf/configs/config.py, change line 13 to _C.resume = True. Then when you run the training code, run like this:
for i in `seq 9999`;do python train.py --cfg XXXXX.yaml;done
This way, whenever the code break, it will resume from where it crashed.

I don't understand. How to solve the problem completely and perfectly. Rerunint the code everytime when it break makes me feels not good and really strange....

cocoshe commented 2 years ago

Well I'm just playing with the standard ZJU-Mocap dataset with subject 387. And I just find that if I'm training with my local desktop (which has two Quadro RTX 6000, 24GB, cuda 11.8), it never fail. And if Ilm training on remote server (which has two GTX 3090, 24GB, cuda 11.6), it will fail at some point. I really don't understand this either.

My computer gpu is AMD so I can't run it on my own computer. I run the code on the cloud server(Tesla V100, 32GB, cuda 11.0), then failed(rgb key error). Strange....I don't get it.

I have a walk-around about this issue. Edit the file humannerf/configs/config.py, change line 13 to _C.resume = True. Then when you run the training code, run like this:
for i in `seq 9999`;do python train.py --cfg XXXXX.yaml;done
This way, whenever the code break, it will resume from where it crashed.

I tried on my local computer, but still got the "rgb key error"...

Miles629 commented 2 years ago

I got the same error when I trained the wild part. I noticed the error happened when the output images were poorly cropped. after I change the offset from 0.3 to 1.5, the error never happend again and output image cropped correctly.

ateplyuk commented 1 year ago

@cocoshe Did you solve the problem ("KeyError: 'rgb")? I have the same issue when running train on single GPU.

cocoshe commented 1 year ago

@cocoshe Did you solve the problem ("KeyError: 'rgb")? I have the same issue when running train on single GPU.

Something wrong with the dataset preparation, I didn't figure out why. I remember there are two ways to prepare our own dataset:

First, clip the video as frames, then put all the frames in a segment model and get img-mask pairs;
First, put the video in a segment model to get the img-mask pairs directly, then clip the origin img, and clip the mask, and the fps are set to be the same...

Howevery, I didn't figure out why, and actually I nearly forgoet detail about it So if you are doing like 1, just try 2, or if you are following the 2, try 1.

So the problem was disappeared, wish you can figure out why

Dipankar1997161 commented 1 year ago

@cocoshe Did you solve the problem ("KeyError: 'rgb")? I have the same issue when running train on single GPU.

the key_error for rgb comes when the masked image during training, does not align with the camera values, thus failing to project the camera rays on the image. so it returned an empty ray_mask at the end.

I tried to process my data with a higher T value and got the rgb error but when I adjusted the T / 1000, the rendering started properly.

so make sure the alpha masks are well aligned with the images and the image centre of the masks are in order with the camera matrix(K) in case you resize the images.

hope this helps

TheChildishMillennial commented 8 months ago

In my case, this error was caused by opencv. OpenCV uses BGR format so you have to convert it to RGB before saving the image by running something like cv2.cvtColor(frame, cv2.COLOR_BGR2RGB).

Dipankar1997161 commented 8 months ago

In my case, this error was caused by opencv. OpenCV uses BGR format so you have to convert it to RGB before saving the image by running something like cv2.cvtColor(frame, cv2.COLOR_BGR2RGB).

There will be endless more reasons for the RGB error, one of which you will encounter a lot is when the camera rays miss the target returning null values

chungyiweng / humannerf

"KeyError: 'rgb'" when training for monocular dataset #49