Training error on custom dataset a few epochs in after pre processing

command run: python main.py data/ian/ --workspace trial_ian/ -O --iters 200000 I'm trying to train my own model the path specified exists 95.png in my drive and it is mounted it is weird it worked on other epochs. Seems like colab may of just messed up but runtime was still connected.

Error: Namespace(path='data/ian/', O=True, test=False, test_train=False, data_range=[0, -1], workspace='trial_ian/', seed=0, iters=200000, lr=0.005, lr_net=0.0005, ckpt='latest', num_rays=65536, cuda_ray=True, max_steps=16, num_steps=16, upsample_steps=0, update_extra_interval=16, max_ray_batch=4096, fp16=True, lambda_amb=0.1, bg_img='', fbg=False, exp_eye=True, fix_eye=-1, smooth_eye=False, torso_shrink=0.8, color_space='srgb', preload=0, bound=1, scale=4, offset=[0, 0, 0], dt_gamma=0.00390625, min_near=0.05, density_thresh=10, density_thresh_torso=0.01, patch_size=1, finetune_lips=False, smooth_lips=False, torso=False, head_ckpt='', gui=False, W=450, H=450, radius=3.35, fovy=21.24, max_spp=1, att=2, aud='', emb=False, ind_dim=4, ind_num=10000, ind_dim_torso=8, amb_dim=2, part=False, part2=False, train_camera=False, smooth_path=False, smooth_path_window=7, asr=False, asr_wav='', asr_play=False, asr_model='cpierse/wav2vec2-large-xlsr-53-esperanto', asr_save_feats=False, fps=50, l=10, m=50, r=10) [INFO] load 2030 train frames. [INFO] load aud_features: torch.Size([2229, 44, 16]) Loading train data: 100% 2030/2030 [00:04<00:00, 503.51it/s] [INFO] eye_area: 0.14190673828125 - 0.35076141357421875 Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off] /usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead. warnings.warn( /usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=AlexNet_Weights.IMAGENET1K_V1. You can also use weights=AlexNet_Weights.DEFAULT to get the most up-to-date weights. warnings.warn(msg) Downloading: "https://download.pytorch.org/models/alexnet-owt-7be5be79.pth" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth 100% 233M/233M [00:01<00:00, 138MB/s] Loading model from: /usr/local/lib/python3.10/dist-packages/lpips/weights/v0.1/alex.pth [INFO] Trainer: ngp | 2023-06-26_17-51-59 | cuda | fp16 | trial_ian/ [INFO] #parameters: 3024277 [INFO] Loading latest checkpoint ... [WARN] No checkpoint found, model randomly initialized. [INFO] load 100 val frames. [INFO] load aud_features: torch.Size([2229, 44, 16]) Loading val data: 100% 100/100 [00:00<00:00, 485.16it/s] [INFO] eye_area: 0.20160675048828125 - 0.33855438232421875 [INFO] maxepoch = 99 ==> Start Training Epoch 1, lr=0.000500 ... loss=0.0009 (0.0020), lr=0.000488: 100% 2030/2030 [04:27<00:00, 7.59it/s] ==> Finished Epoch 1. ==> Start Training Epoch 2, lr=0.000488 ... loss=0.0004 (0.0010), lr=0.000477: 100% 2030/2030 [03:41<00:00, 9.17it/s] ==> Finished Epoch 2. ++> Evaluate at epoch 2 ... loss=0.0005 (0.0006): 100% 100/100 [00:16<00:00, 5.96it/s] PSNR = 32.244503 LPIPS (alex) = 0.073856 ++> Evaluate epoch 2 Finished. ==> Start Training Epoch 3, lr=0.000477 ... loss=0.0013 (0.0009), lr=0.000466: 100% 2030/2030 [03:48<00:00, 8.89it/s] ==> Finished Epoch 3. ==> Start Training Epoch 4, lr=0.000466 ... loss=0.0004 (0.0009), lr=0.000455: 100% 2030/2030 [03:45<00:00, 9.02it/s] ==> Finished Epoch 4. ++> Evaluate at epoch 4 ... loss=0.0004 (0.0005): 100% 100/100 [00:16<00:00, 6.16it/s] PSNR = 32.979616 LPIPS (alex) = 0.063454 ++> Evaluate epoch 4 Finished. ==> Start Training Epoch 5, lr=0.000455 ... loss=0.0017 (0.0009), lr=0.000445: 100% 2030/2030 [03:41<00:00, 9.18it/s] ==> Finished Epoch 5. ==> Start Training Epoch 6, lr=0.000445 ... loss=0.0008 (0.0008), lr=0.000435: 100% 2030/2030 [03:36<00:00, 9.38it/s] ==> Finished Epoch 6. ++> Evaluate at epoch 6 ... loss=0.0004 (0.0006): 100% 100/100 [00:15<00:00, 6.64it/s] PSNR = 32.863394 LPIPS (alex) = 0.060131 ++> Evaluate epoch 6 Finished. ==> Start Training Epoch 7, lr=0.000435 ... loss=0.0008 (0.0008), lr=0.000425: 100% 2030/2030 [03:38<00:00, 9.30it/s] ==> Finished Epoch 7. ==> Start Training Epoch 8, lr=0.000425 ... loss=0.0008 (0.0008), lr=0.000415: 100% 2030/2030 [03:38<00:00, 9.28it/s] ==> Finished Epoch 8. ++> Evaluate at epoch 8 ... loss=0.0005 (0.0006): 100% 100/100 [00:15<00:00, 6.56it/s] PSNR = 32.951550 LPIPS (alex) = 0.056809 ++> Evaluate epoch 8 Finished. ==> Start Training Epoch 9, lr=0.000415 ... loss=0.0004 (0.0008), lr=0.000405: 100% 2030/2030 [03:39<00:00, 9.26it/s] ==> Finished Epoch 9. ==> Start Training Epoch 10, lr=0.000405 ... loss=0.0005 (0.0008), lr=0.000396: 100% 2030/2030 [03:40<00:00, 9.22it/s] ==> Finished Epoch 10. ++> Evaluate at epoch 10 ... loss=0.0005 (0.0006): 100% 100/100 [00:14<00:00, 6.74it/s] PSNR = 32.993330 LPIPS (alex) = 0.055514 ++> Evaluate epoch 10 Finished. ==> Start Training Epoch 11, lr=0.000396 ... loss=0.0006 (0.0008), lr=0.000387: 100% 2030/2030 [03:41<00:00, 9.16it/s] ==> Finished Epoch 11. ==> Start Training Epoch 12, lr=0.000387 ... loss=0.0002 (0.0007), lr=0.000380: 76% 1534/2030 [02:48<00:58, 8.41it/s][ WARN:0@2748.867] global loadsave.cpp:244 findDecoder imread('data/ian/torso_imgs/95.png'): can't open/read file: check file path/integrity Traceback (most recent call last): File "/content/drive/MyDrive/RAD-NeRF/main.py", line 235, in File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 906, in train File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 1156, in train_one_epoch File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 633, in next data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 677, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch return self.collate_fn(data) File "/content/drive/MyDrive/RAD-NeRF/nerf/provider.py", line 670, in collate cv2.error: OpenCV(4.7.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

Exception ignored in atexit callback: <function FileWriter.init..cleanup at 0x7fb6a66939a0> Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/tensorboardX/writer.py", line 108, in cleanup self.event_writer.close() File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 156, in close self.flush() File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 148, in flush self._ev_writer.flush() File "/usr/local/lib/python3.10/dist-packages/tensorboardX/event_file_writer.py", line 69, in flush self._py_recordio_writer.flush() File "/usr/local/lib/python3.10/dist-packages/tensorboardX/record_writer.py", line 193, in flush self._writer.flush() OSError: [Errno 107] Transport endpoint is not connected Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner loss=0.0002 (0.0007), lr=0.000380: 76% 1534/2030 [02:48<00:54, 9.10it/s] Exception ignored in: <function Trainer.del at 0x7fb6cf2e5900> Traceback (most recent call last): File "/content/drive/MyDrive/RAD-NeRF/nerf/utils.py", line 704, in del OSError: [Errno 107] Transport endpoint is not connected

ashawkey / RAD-NeRF

Training error on custom dataset a few epochs in after pre processing #61