OOM Error after 1 epoch or in the middle of the epoch

working12 commented 2 years ago

Usually, when using this code, I was getting calibration issues and things like that. But now, after the calibration passes correctly, I notice an "OOM error" in the middle of the epoch.


e001-i0135 => L=3.100 acc= 69% / t(ms):  15.1 1081.0 657.5)
error: Detected 1 oom-kill event(s) in StepId=889106.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
e001-i0136 => L=2.946 acc= 72% / t(ms):  15.0 1096.7 661.1)srun: error: node11: task 0: Out Of Memory
error: Detected 1 oom-kill event(s) in StepId=889106.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

~~Usually the amount of memory (GPU memory and/or RAM) needed by code in each epoch is fixed. So if epoch 1 passed correctly it should not affect the next epochs as well. Is there any reason for this?~~
- ~~Incorrect GPU usage lead to above error.~~
Even with batch_num = 4, and dl=0.02/0.03 I got this error (had close to 32G memory in both GPU and RAM). I have epoch_steps=500 do you think reducing this might be helpful? (since as per #180 this controls the number of spheres in total seen by the model in each iteration)?
- ~~Same as 1~~
~~What are the steps I should follow to solve this kind of issue?~~
~~Also in #180 we talked about 10^5 points in a cloud (20~21 classes). I am not sure how can I feed this entire thing without subsampling since it is quite low in terms of S3DIS and other datasets?~~
- ~~Ideally we would stop the subsampling and just simply use the whole .ply file as the subsampled ply file.~~

A bit more context: (this is the main question now)

Dataset has ~10^5 (of the order) points. The training set has hundreds (~100) of samples. I can't increase subsampling dl since in that case, the generated samples look a bit less understandable (at least visually). So, assuming dl ~0.02 - 0.03 would be reasonable. And also, lower radius 1.2, etc is not performing well. So I have two things that I can change now, radius (increase) and batch_num decrease (but not lower than 3 as mentioned in the repo issues somewhere). What kind of data would do well with this model? How can I infer that the model won't do well in a specific piece of data? ~~With this in mind, if I increase the radius to ~ 3 while keeping the dl fixed it gives me an OOM error.~~ - I also tried to play around a bit to know how I can simply skip the subsampling. But in that case, as well I ran into an OOM error. My samples are small so can I just not provide this directly to the network? Apparently just by changing the subsampled files to original .ply files, by keeping everything the same as the default S3DIS.py file I also ran into an error.

~~Trying to increase radius since performance is very low on the validation set (0.20 IoU).~~

HuguesTHOMAS commented 2 years ago

First a question. Is it a frame dataset like SemanticKitti or an area like S3DIS. Can you post a picture of the data?

Here are some suggestions:

You can use the lowest batch_num you can (3 is a good guess).
You can try to reduce the number of neighbor points in a convolution by reducing conv_radius. This will surely mean a loss in performance though.
Try to get the highest dl without losing too much performance. Try 0.03, maybe 0.032, 0.034...

Everything is a question of trade-off between speed and performance.

Also, consider there might be a bug in your code that makes the memory grow. This is always a possibility if you modified some stuff because of your new data. Does the memory on your GPU increase constantly during training or is it random?

Another possibility: Maybe you have a part of your dataset that is particularly more dense than the other and that makes everything crash?

working12 commented 2 years ago

Memory issues in the middle of the epoch were due to one of my changes.

Regarding the performance, both of the things mentioned would be bad (increasing dl and increasing conv_radius). Also, dataset density is non-uniform. (Lots of points are in the front - and few points are far from the camera - so on a side view it would look like this) @HuguesTHOMAS

lxzbg commented 2 years ago

Memory issues in the middle of the epoch were due to one of my changes.

Regarding the performance, both of the things mentioned would be bad (increasing dl and increasing conv_radius). Also, dataset density is non-uniform. (Lots of points are in the front - and few points are far from the camera - so on a side view it would look like this) @HuguesTHOMAS I also meet the OOM error, Does the code support multi gpu trainning?

HuguesTHOMAS / KPConv-PyTorch

OOM Error after 1 epoch or in the middle of the epoch #181