Open trandangtrungduc opened 1 year ago
Hi!
We train on A40 GPU with 48GB RAM with 2cm voxelization.
You can train on 3090s with 5cm voxelization, i.e. data.voxel_size=0.05
. However, performance will be worse.
Best, Jonas
Thank you for your advice. I modified it to data.voxel_size=0.05
and data.batch_size=1
but I still encountered "BATCH TOO BIG". After retracing the code, I think the problem comes from this snippet in trainer.py
:
def training_step(self, batch, batch_idx):
data, target, file_names = batch
if data.features.shape[0] > self.config.general.max_batch_size:
print("data exceeds threshold")
raise RuntimeError("BATCH TOO BIG")
With max_batch_size=8
and data.features.shape[0]
is a very large number. Do you think it's a code logic error? I would appreciate it if you give me some advice.
Moreover, when I convert the above code into a comment, I can train but the result I get is 0 and nan like this. Is it a problem or should I wait a few more epochs?
################################################################
what : AP AP_50% AP_25%
################################################################
cabinet : 0.000 0.000 0.000
bed : nan nan nan
chair : 0.000 0.000 0.000
sofa : nan nan nan
table : 0.000 0.000 0.000
door : 0.000 0.000 0.000
window : 0.000 0.000 0.000
bookshelf : nan nan nan
picture : nan nan nan
counter : 0.000 0.000 0.000
desk : nan nan nan
curtain : nan nan nan
refrigerator : 0.000 0.000 0.000
shower curtain : nan nan nan
toilet : nan nan nan
sink : 0.000 0.000 0.000
bathtub : nan nan nan
otherfurniture : 0.000 0.000 0.000
----------------------------------------------------------------
average : 0.000 0.000 0.000
Thank you for your advice. I modified it to
data.voxel_size=0.05
anddata.batch_size=1
but I still encountered "BATCH TOO BIG". After retracing the code, I think the problem comes from this snippet intrainer.py
:def training_step(self, batch, batch_idx): data, target, file_names = batch if data.features.shape[0] > self.config.general.max_batch_size: print("data exceeds threshold") raise RuntimeError("BATCH TOO BIG")
With
max_batch_size=8
anddata.features.shape[0]
is a very large number. Do you think it's a code logic error? I would appreciate it if you give me some advice. Moreover, when I convert the above code into a comment, I can train but the result I get is 0 and nan like this. Is it a problem or should I wait a few more epochs?################################################################ what : AP AP_50% AP_25% ################################################################ cabinet : 0.000 0.000 0.000 bed : nan nan nan chair : 0.000 0.000 0.000 sofa : nan nan nan table : 0.000 0.000 0.000 door : 0.000 0.000 0.000 window : 0.000 0.000 0.000 bookshelf : nan nan nan picture : nan nan nan counter : 0.000 0.000 0.000 desk : nan nan nan curtain : nan nan nan refrigerator : 0.000 0.000 0.000 shower curtain : nan nan nan toilet : nan nan nan sink : 0.000 0.000 0.000 bathtub : nan nan nan otherfurniture : 0.000 0.000 0.000 ---------------------------------------------------------------- average : 0.000 0.000 0.000
I'm facing the same problem as yours, like:
average : 0.011 0.032 0.032
Have you already solved the problem or found what caused it? I would appreciate it if you give me some advice.
Hi,
max_batch_size=8
means that you only train on 8 points in total in one training iterations. Therefore all point clouds will be filtered and your network cannot learn anything. This parameter is very sensititve. Try to set it as large as possible until you run into memory issues.
Let me know if this helps :)
Best, Jonas
Thanks for your reply!
Parameter max_batch_size
is set to a default value of 99999999 in the config file here and i didn't modify it. But I did set data.batch_size=1
in .sh file to solve the memory issues. Now, follow your advice, i set it larger and enlarge the parameter data.voxel_size
instead, do you think this modification is appropriate?
Furthermore, I encountered another issue: i'm training on my custom dataset and have defined my own list of semantic classes. However, during the validation stage, the semantic classes listed in the AP table are always ceiling, floor, wall, beam(shown as my comment above), which are different from my custom classes list. Could you please guide me on how to modify this?
I eagerly await your response!
I have adjusted the 'scannet_val.sh' file as follows:
I got the NaN result when I tried the 'scannet_val.sh'. Besides, The error of batch size being too large is also raised. Do you have any idea how to fix this (I'm using a GPU 3090 with 24GB memory). Thank you.