HuguesTHOMAS / KPConv-PyTorch

Kernel Point Convolution implemented in PyTorch
MIT License
767 stars 153 forks source link

Point Cloud Density and V-RAM OOM #199

Open TobiasMascetta opened 1 year ago

TobiasMascetta commented 1 year ago

Hi Hugues,

I have one further question about the flexibility of your model and its V-RAM usage.

[Question] I am encountering heavily varying V-RAM usage and even mid-training OOM on a 12 GB GPU. I am using my own dataset for a modified, previously unsopported task and I want to make sure that I dont have a bug in my code.

From my understanding of KPConv, strongly varying point density in the network input leads to varying V-RAM demands. Is that correct?

Thank you for your help!

[Background] I have read the previous issues on that topic: #181, #145 I am using race-track data (so larger point clouds with strongly varying density, especially along upper axis) and I use a randomly subsampled point cloud as input (instead of your grid-subsampling) for point cloud reconstruction and other tasks. I also have to only use a batch size of 1, due to unfortunatly overlapping hardware and task constraints. I therefore turned off the batch normalization, which did not affect my V-RAM issue. My new network tasks are working fine, but the V-RAM usage is odd and is always randomly changing (in my case between 5GB and 11GB after parameter tuning for 25k points with feature size 3). The memory usage does not grow over time and is randomly. I empty the cuda cache now after each training step to further investigate the problem and from my understanding, it is not a bug in my code but just caused by the flexibility, so a desired side-effect of KPConv itself. I basically just ask kindly for your opinion, if the described effect can be caused by KPConv itself or definitively not, so I know whether I have to continue my (so far unfruitful) bug search.

Best regards

HuguesTHOMAS commented 1 year ago

Hi Tobias,

Indeed considering your data, it is not surprising that you have randomly varying VRAM usage and eventually OOM error. With a batch size of 1, you do not benefit from the varying batch size strategy to balance the VRAM usage, and with this type of data, (varying number of points as input), you automatically get a varying memory usage (in any network not only KPConv by the way).

Here are two ideas to help solve your issue. 1) limit the number of points in input point clouds, when you randomly subsample, just ensure you don't pick more than a certain number 2) train on smaller parts of your data if you can. Removing part of the data can even end up being a good augmentation strategy. When you test, you can use the full point clouds as the memory usage is considerably reduced.

Best, Hugues