Open working12 opened 2 years ago
First a question. Is it a frame dataset like SemanticKitti or an area like S3DIS. Can you post a picture of the data?
Here are some suggestions:
conv_radius
. This will surely mean a loss in performance though.dl
without losing too much performance. Try 0.03, maybe 0.032, 0.034...Everything is a question of trade-off between speed and performance.
Also, consider there might be a bug in your code that makes the memory grow. This is always a possibility if you modified some stuff because of your new data. Does the memory on your GPU increase constantly during training or is it random?
Another possibility: Maybe you have a part of your dataset that is particularly more dense than the other and that makes everything crash?
Memory issues in the middle of the epoch were due to one of my changes.
Regarding the performance, both of the things mentioned would be bad (increasing dl
and increasing conv_radius
). Also, dataset density is non-uniform. (Lots of points are in the front - and few points are far from the camera - so on a side view it would look like this) @HuguesTHOMAS
Memory issues in the middle of the epoch were due to one of my changes.
Regarding the performance, both of the things mentioned would be bad (increasing
dl
and increasingconv_radius
). Also, dataset density is non-uniform. (Lots of points are in the front - and few points are far from the camera - so on a side view it would look like this) @HuguesTHOMAS I also meet the OOM error, Does the code support multi gpu trainning?
Usually, when using this code, I was getting calibration issues and things like that. But now, after the calibration passes correctly, I notice an "OOM error" in the middle of the epoch.
Usually the amount of memory (GPU memory and/or RAM) needed by code in each epoch is fixed. So if epoch 1 passed correctly it should not affect the next epochs as well. Is there any reason for this?Incorrect GPU usage lead to above error.Even withbatch_num = 4
, anddl=0.02/0.03
I got this error (had close to 32G memory in both GPU and RAM). I haveepoch_steps=500
do you think reducing this might be helpful? (since as per #180 this controls the number of spheres in total seen by the model in each iteration)?Same as 1What are the steps I should follow to solve this kind of issue?Also in #180 we talked about 10^5 points in a cloud (20~21 classes). I am not sure how can I feed this entire thing without subsampling since it is quite low in terms of S3DIS and other datasets?Ideally we would stop the subsampling and just simply use the whole.ply
file as the subsampledply
file.A bit more context: (this is the main question now)
dl ~0.02 - 0.03
would be reasonable. And also, lower radius 1.2, etc is not performing well. So I have two things that I can change now, radius (increase) andbatch_num
decrease (but not lower than 3 as mentioned in the repo issues somewhere). What kind of data would do well with this model? How can I infer that the model won't do well in a specific piece of data?With this in mind, if I increase the radius to ~ 3 while keeping the dl fixed it gives me an OOM error.- I also tried to play around a bit to know how I can simply skip the subsampling. But in that case, as well I ran into an OOM error. My samples are small so can I just not provide this directly to the network? Apparently just by changing the subsampled files to original.ply
files, by keeping everything the same as the defaultS3DIS.py
file I also ran into an error.