RuntimeError: CUDA error: out of memory

henrypearce4D commented 1 month ago

I'm running an a6000 48gb gpu, any suggestions to get this to run?

ns-train splatfacto-w --data data/phototourism/trevi-fountain/

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

KevinXu02 commented 1 month ago

Hi! Can you try adding nerf-w-data-parser-config --data_name trevi? It should be able to run within 8G of GRAM. (If using the full dataset it won't have chance to go through the entire dataset each opacity reset and the gs will never be culled) Another way is to set reset_alpha_every to more than 30 and you can use colmap dataparser to run all the images.

KevinXu02 commented 1 month ago

Thank you for catching this. Readme is updated accordingly.

henrypearce4D commented 1 month ago

I also edited the line here from brandenburg-gate to trevi; https://github.com/KevinXu02/splatfacto-w/blob/c6aab6b6386fe66796a16df9d40d44a69a80b061/splatfactow/nerfw_dataparser.py#L51 Am I right in saying adding nerf-w-data-parser-config --data_name trevi will also set this for me?

KevinXu02 commented 1 month ago

You might need to change both line 51 and 53. Data name is used for reading the data split. https://github.com/KevinXu02/splatfacto-w/blob/c6aab6b6386fe66796a16df9d40d44a69a80b061/splatfactow/nerfw_dataparser.py#L53

KevinXu02 commented 1 month ago

The command should bens-train splatfacto-w --data data/phototourism/trevi-fountain/ nerf-w-data-parser-config --data_name trevi which will do all the changes.

henrypearce4D commented 1 month ago

Im still getting the same error

I also tried adding this before running the command;

export TORCH_CUDA_ARCH_LIST="8.6"
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1

Heres a longer snippet of the error

[15:43:22] Caching / undistorting train images                                            splatfactow_datamanager.py:213
Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 22.4927
VanillaPipeline.get_train_loss_dict: 22.4919
Traceback (most recent call last):
  File "/home/infinite/miniconda3/envs/nerfstudio/bin/ns-train", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/scripts/train.py", line 262, in entrypoint
    main(
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/scripts/train.py", line 247, in main
    launch(
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/scripts/train.py", line 189, in launch
    main_func(local_rank=0, world_size=world_size, config=config)
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/scripts/train.py", line 100, in train_loop
    trainer.train()
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/engine/trainer.py", line 262, in train
    loss, loss_dict, metrics_dict = self.train_iteration(step)
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/engine/trainer.py", line 497, in train_iteration
    _, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/utils/profiler.py", line 111, in inner
    out = func(*args, **kwargs)
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/site-packages/nerfstudio/pipelines/base_pipeline.py", line 299, in get_train_loss_dict
    ray_bundle, batch = self.datamanager.next_train(step)
  File "/mnt/d/wslworkspace/nerfstudio/splatfacto-w/splatfactow/splatfactow_datamanager.py", line 333, in next_train
    data = deepcopy(self.cached_train[image_idx])
  File "/home/infinite/miniconda3/envs/nerfstudio/lib/python3.8/functools.py", line 967, in __get__
    val = self.func(instance)
  File "/mnt/d/wslworkspace/nerfstudio/splatfacto-w/splatfactow/splatfactow_datamanager.py", line 162, in cached_train
    return self._load_images("train", cache_images_device=self.config.cache_images)
  File "/mnt/d/wslworkspace/nerfstudio/splatfacto-w/splatfactow/splatfactow_datamanager.py", line 235, in _load_images
    cache["image"] = cache["image"].pin_memory()
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

KevinXu02 commented 1 month ago

Could you please try remove the pin_memery in this line? cache["image"] = cache["image"].pin_memory()

henrypearce4D commented 1 month ago

its now training thankyou!

Should I be able to change the appearance embedded as it trains? it doesn't seem to switch instantly but looks like it is gradually changing.

KevinXu02 commented 1 month ago

Yes, you should be able to. But you need like 15k iters to let it converge a bit.

henrypearce4D commented 1 month ago

And finally, if my images are captured from a static multicamera rig and I am changing the lighting captured, I presume the camera will be overlapping on top of each other as they are the same camera in the same position, will I be able to select them?

henrypearce4D commented 1 month ago

@KevinXu02 My test actually worked, but I cant visualise the different camera index in viser as the camera overlap for selection. I see you added a to-do list for rendering and export, do you think the index selection could be added to viser?

KevinXu02 commented 1 month ago

Technically yes. But this might be not ideal as nerfstudio has lots of methods and that textbox would only work for splatw. The overlapping is indeed tricky and currently the only way I can think of to fix this is hacking the code.

henrypearce4D commented 1 month ago

ok thanks for the info! amazing work!

KevinXu02 / splatfacto-w

RuntimeError: CUDA error: out of memory #6