Open samhodge-aiml opened 1 year ago
Options attached. options.zip
After 2 hours it had not completed 10 iterations, what am I doing wrong?
@samhodge @samhodge-aiml I'm not super confident whether BARF would work well on your data, as the viewpoint coverage is not as dense as what we had been experimenting before. My estimate of the runtime on a 3090 would be 8-10 hours, but I don't have one to benchmark with so I cannot say for sure (also it has been quite a while since I developed this project). The training shouldn't get stuck at 10 iterations though -- could you share the training log?
that is the thing the GPU was loaded up (RTX 3090, 24 Gb, 98% 39% GPU compute, 20% <5% GPU memory)
But nothing really being logged at all.
I will try running it again and see if I can get something to share with you.
There was no error, no Tensorboard logs to speak of, but a file in the output directory, so write permission was OK, I turned off visdom
Let me give you everything I have so far and we can get to the bottom of it.
Thanks a million for the response.
here is the stdout
python3 train.py --group=samh --model=barf --yaml=barf_iphone --name=bakerst006 --data.scene=bakerst --barf_c2f=[0.1,0.5] --visdom!
Process ID: 18377
[train.py] (PyTorch code for training NeRF/BARF)
setting configurations...
loading options/base.yaml...
loading options/nerf_llff.yaml...
loading options/barf_llff.yaml...
loading options/barf_iphone.yaml...
* H: 480
* W: 640
* arch:
* density_activ: softplus
* layers_feat: [None, 256, 256, 256, 256, 256, 256, 256, 256]
* layers_rgb: [None, 128, 3]
* posenc:
* L_3D: 10
* L_view: 4
* skip: [4]
* tf_init: True
* barf_c2f: [0.1, 0.5]
* batch_size: None
* camera:
* model: perspective
* ndc: False
* noise: None
* cpu: False
* data:
* augment:
* center_crop: None
* dataset: iphone
* image_size: [480, 640]
* num_workers: 4
* preload: True
* root: None
* scene: bakerst
* train_sub: None
* val_on_test: False
* val_ratio: 0.1
* val_sub: None
* device: cuda:0
* freq:
* ckpt: 5000
* scalar: 200
* val: 2000
* vis: 1000
* gpu: 0
* group: samh
* load: None
* loss_weight:
* render: 0
* render_fine: None
* max_epoch: None
* max_iter: 10
* model: barf
* name: bakerst006
* nerf:
* density_noise_reg: None
* depth:
* param: inverse
* range: [1, 0]
* fine_sampling: False
* rand_rays: 2048
* sample_intvs: 128
* sample_intvs_fine: None
* sample_stratified: True
* setbg_opaque: None
* view_dep: True
* optim:
* algo: Adam
* lr: 0.001
* lr_end: 0.0001
* lr_pose: 0.003
* lr_pose_end: 1e-05
* sched:
* gamma: None
* type: ExponentialLR
* sched_pose:
* gamma: None
* type: ExponentialLR
* test_iter: 100
* test_photo: True
* warmup_pose: None
* output_path: output/samh/bakerst006
* output_root: output
* resume: False
* seed: 0
* tb:
* num_images: [4, 8]
* visdom: False
* yaml: barf_iphone
(creating new options file...)
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=AlexNet_Weights.IMAGENET1K_V1`. You can also use `weights=AlexNet_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Loading model from: /media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/lpips/weights/v0.1/alex.pth
loading training data...
number of samples: 75
loading test data...
number of samples: 8
building networks...
setting up optimizers...
initializing weights from scratch...
setting up visualizers...
TRAINING START
validating: 0%| | 0/8 [00:00<?, ?it/s]/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680557665316/work/aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
Sitting at this point
nvidia-smi
Wed Aug 30 18:37:26 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:06:00.0 On | N/A |
| 66% 68C P2 243W / 350W | 1047MiB / 24576MiB | 39% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 Off | 00000000:0A:00.0 Off | N/A |
| 32% 43C P8 23W / 350W | 15MiB / 24576MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2400 G /usr/lib/xorg/Xorg 134MiB |
| 0 N/A N/A 4361 G /usr/bin/gnome-shell 95MiB |
| 0 N/A N/A 14240 G ...sion,SpareRendererForSitePerProcess 53MiB |
| 0 N/A N/A 18377 C python3 666MiB |
| 0 N/A N/A 18794 G ...4151621,13186568319809438527,262144 77MiB |
| 1 N/A N/A 2400 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
One hour later no progress, I will leave it running overnight and see if anything happens
It has been running for over 10 hours now and now progress, I am going to save the electricity.
This shouldn't happen. Could you help pinpoint which line it hangs at?
I can certainly keyboard interrupt the job and give you the stack trace
Traceback (most recent call last):
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/train.py", line 32, in <module>
main()
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/train.py", line 29, in main
m.train(opt)
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 54, in train
if self.iter_start==0: self.validate(opt,0)
^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/barf.py", line 66, in validate
super().validate(opt,ep=ep)
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/base.py", line 152, in validate
var = self.graph.forward(opt,var,mode="val")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 210, in forward
ret = self.render_by_slices(opt,pose,intr=var.intr,mode=mode) if opt.nerf.rand_rays else \
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 267, in render_by_slices
ret = self.render(opt,pose,intr=intr,ray_idx=ray_idx,mode=mode) # [B,R,3],[B,R,1]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/model/nerf.py", line 236, in render
center,ray = camera.get_center_and_ray(opt,pose,intr=intr) # [B,HW,3]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/camera.py", line 241, in get_center_and_ray
grid_3D = cam2world(grid_3D,pose) # [B,HW,3]
^^^^^^^^^^^^^^^^^^^^^^^
File "/media/sam/aimlwork/github/bundle-adjusting-NeRF/camera.py", line 213, in cam2world
return X_hom@pose_inv.transpose(-1,-2)
^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Could it be that the focal length for the camera is causing an unsolvable matrix?
Might try here tomorrow https://camp-nerf.github.io/
Yes, it is likely stuck in the loop as in #76. If you use batch size 1 the issue will likely go away -- I have not been able to figure out exactly where the bug was. CamP should be a quite decent improvement over BARF in joint camera optimization. I would definitely encourage you to try it out if they have the code released.
no code yet, batch size of one it is
Batch size of one didn't seem to work for me either.
Hi,
while I was working with this codebase I have faced similar issue (training stuck in endless loop). It has turned out that during sampling along the ray, there was exponential (kind of) grow in depth for last few samples with the last ones as big as few thousends (or even 10000 on one occasion). It caused gradients to explode during backward propagation and some of parameters became NaN's, hence calculeted rays got NaN values in them. I wasn't able to pinpoint specific error in implementation. Bare in mind that I was experimenting on heavily modified architecture so I encurage you to check for abnormal values, details in doc.
There are number of strategies to deal with this problem (assuming that gradient explosion is what causing it), the simplest is to clip abnormal samples, which is very fast workaround. This can affect the results, but erroneous samples make up a very small proportion of the total training data, so it shouldn't be too bad.
Thanks a million maybe tomorrow I can eek out a little time to see if I can make this into a PR
The information is very generous but I am not sure if my skills are ready right now to debug and patch the issue, but why die wondering right? I will see what I can do
@SwirtaB thanks for the feedback! I hadn't been able to deterministically reproduce this issue, and did not realize it had to do with the sampled coordinates. In this case, this line is likely the culprit, where the depth of the last sample is set to a very large number (1e10). @samhodge if you find that tweaking the code to lower it to e.g. 1e3 would help, please let me know and I'm happy to make a hotfix.
Yeah I can certainly write a smoothstep function to roll it off to a limit.
Trying this
diff --git a/model/nerf.py b/model/nerf.py
index b0dcb2c..eefef60 100644
--- a/model/nerf.py
+++ b/model/nerf.py
@@ -393,7 +393,7 @@ class NeRF(torch.nn.Module):
ray_length = ray.norm(dim=-1,keepdim=True) # [B,HW,1]
# volume rendering: compute probability (using quadrature)
depth_intv_samples = depth_samples[...,1:,0]-depth_samples[...,:-1,0] # [B,HW,N-1]
- depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e10)],dim=2) # [B,HW,N]
+ depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e3)],dim=2) # [B,HW,N]
dist_samples = depth_intv_samples*ray_length # [B,HW,N]
sigma_delta = density_samples*dist_samples # [B,HW,N]
alpha = 1-(-sigma_delta).exp_() # [B,HW,N]
I have another idea, that one didn't work:
https://numpy.org/doc/stable/reference/generated/numpy.heaviside.html
Other things that do not work
diff --git a/data/iphone.py b/data/iphone.py
index 05cf1d5..e34bcc8 100644
--- a/data/iphone.py
+++ b/data/iphone.py
@@ -17,7 +17,7 @@ from util import log,debug
class Dataset(base.Dataset):
def __init__(self,opt,split="train",subset=None):
- self.raw_H,self.raw_W = 1080,1920
+ self.raw_H,self.raw_W = 3024,4032
super().__init__(opt,split)
self.root = opt.data.root or "data/iphone"
self.path = "{}/{}".format(self.root,opt.data.scene)
@@ -62,7 +62,7 @@ class Dataset(base.Dataset):
return image
def get_camera(self,opt,idx):
- self.focal = self.raw_W*4.2/(12.8/2.55)
+ self.focal = self.raw_W*1.6*35.0
intr = torch.tensor([[self.focal,0,self.raw_W/2],
[0,self.focal,self.raw_H/2],
[0,0,1]]).float()
diff --git a/model/nerf.py b/model/nerf.py
index b0dcb2c..9a02e77 100644
--- a/model/nerf.py
+++ b/model/nerf.py
@@ -391,9 +391,11 @@ class NeRF(torch.nn.Module):
def composite(self,opt,ray,rgb_samples,density_samples,depth_samples):
ray_length = ray.norm(dim=-1,keepdim=True) # [B,HW,1]
+ ray_length = numpy.clip(ray_length, 0, 1e3)
+
# volume rendering: compute probability (using quadrature)
depth_intv_samples = depth_samples[...,1:,0]-depth_samples[...,:-1,0] # [B,HW,N-1]
- depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e10)],dim=2) # [B,HW,N]
+ depth_intv_samples = torch.cat([depth_intv_samples,torch.empty_like(depth_intv_samples[...,:1]).fill_(1e3)],dim=2) # [B,HW,N]
dist_samples = depth_intv_samples*ray_length # [B,HW,N]
sigma_delta = density_samples*dist_samples # [B,HW,N]
alpha = 1-(-sigma_delta).exp_() # [B,HW,N]
diff --git a/options/barf_iphone.yaml b/options/barf_iphone.yaml
index f344c7b..d58794b 100644
--- a/options/barf_iphone.yaml
+++ b/options/barf_iphone.yaml
@@ -2,5 +2,7 @@ _parent_: options/barf_llff.yaml
data: # data options
dataset: iphone # dataset name
- scene: IMG_0239 # scene name
+ scene: bakerst # scene name
image_size: [480,640] # input image sizes [height,width]
+max_iter: 10
+batch_size: 1
diff --git a/requirements.yaml b/requirements.yaml
index 0baf8b0..2865db4 100644
--- a/requirements.yaml
+++ b/requirements.yaml
@@ -2,6 +2,7 @@ name: barf-env
channels:
- conda-forge
- pytorch
+ - nvidia
dependencies:
- numpy
- scipy
@@ -10,7 +11,8 @@ dependencies:
- easydict
- imageio
- ipdb
- - pytorch>=1.9.0
+ - pytorch
+ - pytorch-cuda=11.8
- torchvision
- tensorboard
- visdom
@chenhsuanlin no problem. I have gave your suggestion a try and it only delayed the problem for me, training have hang much later. Then I have cross checked your implementation of composit with NeRF article and their official implementation. By my understanding whole equation 3 from article reduces to alpha composition. In their implementation, they calculate it slightly different (original impl), so I gave it a try. I have commented T calculation and calculate prob as:
prob = (alpha * torch.cumprod(1.0 - alpha + 1e-10, dim=2))[..., None]
Unfortunately that didn't solve the problem, only delayed it again. That being said any workaround that ensures proper samples values (either by clipping or something else) works quite well. Maybe that is proper solution, since NeRF's are still neural networks and improper inputs could leads to all sorts of problems.
EDIT: T calculations and those from original implementation are identical (in the math sense) and differ only in numerical approach, it haven't noticed it at the beginning.
I have a series of photos: https://drive.google.com/drive/folders/1ZZgZUrFrnP47rx8bN5K6yvYnSC50a-9G?usp=drive_link
Which were take with an iPhone 13 Pro Max
I have used this dataset with Instant NGP from NVIDIA and with Gaussian Splatting to produce a good radiance field.
Do you think this dataset will work with the code in this repository.
My changes are recorded here
and I removed "IMG_" from the file names.
I am training the model now.
Do you have an estimate of how long this might take on a RTX 3090.
What viewer can I use to make renders from the radiance field produced from this training run?
Example image below, EXIF information should be intact:
Sam