While investigating #192, I noticed that --tracer.raymarch-type voxel triggers OutOfMemoryError as below
other traceback lines
...
File "/home/atsushi/workspace/wisp211/wisp/tracers/packed_rf_tracer.py", line 130, in trace
hit_ray_d = rays.dirs.index_select(0, ridx)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.15 GiB (GPU 0; 11.69 GiB total capacity; 10.22 GiB already allocated; 133.44 MiB free; 10.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
❯ nvidia-smi
Sat Jun 29 01:30:32 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4070 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 40C P8 14W / 285W | 848MiB / 12282MiB | 41% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1750 G /usr/lib/xorg/Xorg 416MiB |
| 0 N/A N/A 1943 C+G ...libexec/gnome-remote-desktop-daemon 195MiB |
| 0 N/A N/A 1995 G /usr/bin/gnome-shell 98MiB |
| 0 N/A N/A 5488 G ...57,262144 --variations-seed-version 109MiB |
| 0 N/A N/A 8436 G /app/bin/wezterm-gui 9MiB |
+-----------------------------------------------------------------------------------------+
As you can see, 4.15 GiB is tried to be allocated while 10.22 GiB are already used. I observed similar results regardless of whether an interactive app is loaded or not. I thought that simply other apps use pretty large VRAM and checked that usage by running nvidia-smi immediately after trying to train a nerf. As you can see, however, the result is less than 1GiB is used. My assumption is a nerf app tries to allocate quite large VRAM sequentially and fails at some point. Does anybody know a potential cause of this issue?
While investigating #192, I noticed that
--tracer.raymarch-type voxel
triggers OutOfMemoryError as belowAs you can see, 4.15 GiB is tried to be allocated while 10.22 GiB are already used. I observed similar results regardless of whether an interactive app is loaded or not. I thought that simply other apps use pretty large VRAM and checked that usage by running
nvidia-smi
immediately after trying to train a nerf. As you can see, however, the result is less than 1GiB is used. My assumption is a nerf app tries to allocate quite large VRAM sequentially and fails at some point. Does anybody know a potential cause of this issue?Thanks in advance!