CUDA out of memory even with store unused models in CPU memory operation - Githubissues

facebookresearch / localrf

An algorithm for reconstructing the radiance field of a large-scale scene from a single casually captured video.

MIT License

956 stars 62 forks source link

CUDA out of memory even with store unused models in CPU memory operation #45

Open tb2-sy opened 8 months ago

tb2-sy commented 8 months ago

Thanks for your nice work! I'm applying this model to a very long video trajectory, but I'm finding that CUDA out of memory occurs. According to my understanding of the code, there is an operation of putting all previously unused tensorf on the CPU, which greatly reduces the occupation of CUDA memory. In theory, does it support infinitely long video sequences in model training? However, during the training process, the gpu card memory is still exceeded. What is the reason? Looking forward to your reply.

ameuleman commented 8 months ago

Hi, Yes previously optimized models should be stored in cpu memory. I will double check. What gpu are you using? How many frames are being considered?

tb2-sy commented 8 months ago

Hi, Yes previously optimized models should be stored in cpu memory. I will double check. What gpu are you using? How many frames are being considered?

Hi, I am using 48G A40 and 1000 frames of images. Maybe my 1000 images cover too big a scene?

ameuleman commented 8 months ago

I have optimized for longer sequences with 24 GB gpus. Would you mind sharing logs? At which point does it crash?

tb2-sy commented 8 months ago

I have optimized for longer sequences with 24 GB gpus. Would you mind sharing logs? At which point does it crash?

I added some mlp layers to the class MLPRender module in the model. The specific error location is here. However, no error will be reported during the first few tensorf training processes, and an error will be reported after a few hundred frames.

ameuleman commented 8 months ago

That is odd. In GPU memory, there should only be more poses, which are tiny. Do you know if it crashes during training or testing? (it render some test frames during optimization)

tb2-sy commented 8 months ago

The error is reported during the forward process of training. It cannot be ruled out that the error is reported because the number of training frames increases and the pose parameters increase. This is because the gpu card memory may be close to full. Maybe the useless pose parameters can also be placed on the CPU. This way there are no other factors causing CUDA out of memory.

ameuleman commented 8 months ago

OK, I'll recheck and make sure there is nothing unused in GPU memory this afternoon.

tb2-sy commented 8 months ago

OK, I'll recheck and make sure there is nothing unused in GPU memory this afternoon.

Thank you so much!

ameuleman commented 8 months ago

I now delete optimizers and compute the alpha mask on CPU. Please let me know if the issue remains. See https://github.com/facebookresearch/localrf/commit/3905e3988e6f0e977a625b8f1f3710e90442f06b

tb2-sy commented 8 months ago

I now delete optimizers and compute the alpha mask on CPU. Please let me know if the issue remains. See 3905e39

Okay, I'll test it now. Finally, I would like to ask you, will this change affect the model performance?

ameuleman commented 8 months ago

No it should be the same