Segmentation fault - Githubissues

bruinxiong commented 3 years ago

is there any good solution to solve the segmentation fault occured randomly when training pretrain model?

creiser commented 3 years ago

I have never experienced a segmentation fault. It is especially weird that this happens during vanilla NeRF pretraining since for that the implementation of Yen Chen-Lin is used more or less 1:1 (no custom CUDA kernels or anything like that here)

Is it maybe just an out-of-memory error? Can you provide a stack trace? What machine (OS, GPU, driver version) are you using? Is this occurring for a custom scene or one of the scenes from the paper?

bruinxiong commented 3 years ago

@creiser Thanks for your reply. I'm also weird that I met a lot error when training pretrain model. Such as segmentation fault randomly, cuda error: an illegal memory access, even double free to corruption (!prev) and so on. All are related to memory access issues. After asked google for several days, right now, I solve those issues after upgrade cuda toolkit from 11.1.1 to 11.2.0, pytorch from 1.8.1 to 1.9.0 , cudnn 8.1 and even installed libtcmalloc-minimal4. However, I found that if I use default pretrain cfg such as TanksAndTemple/Truck, the same specification GPU (Nvidia 1080Ti 11GB) as yours, I met always out-of-memory error of gpu. So, I have to decrease the hidden_layer_size from 256 to 128 and stop render_testset (loading testing data to GPU for evaluation) in order to keep normal training procedure. I try to transfer testing part to another gpu but failed because the codes are too coupled with one gpu to decouple easily for another gpu. Do you have any idea or suggestion for this ?

bruinxiong commented 3 years ago

BTW, I update my enviroment: OS: Ubuntu 18.04 LTS GPU: NVIDIA GTX 1080 Ti x 3 , 460.80 driver And, I run code in docker.

creiser commented 3 years ago

@bruinxiong I am not sure why you are getting these OOM errors. Are maybe other processes taking away a portion of the GPU memory? To avoid OOM errors during rendering you can decrease _chunksize in the config file.

creiser / kilonerf

Segmentation fault #5