art-programmer / MASC

MIT License
58 stars 6 forks source link

Segmentation fault #3

Closed mhalber closed 5 years ago

mhalber commented 5 years ago

Hi, Thank you for providing this code! I have successfully run the script to prepare the data for Scannet, however when attempting to run the training, I am sadly running into a segfault.

The console output before crash:

keyname=instance_normal_augment_2 task=train started
the number of images val 20
the number of images train 1201
the number of images 1201

Through some print statement abuse, I've managed to see that the code seems to be breaking in function forward( self, coords, faces, colors, instances), file models/instance.py, at line 199

Python, gcc, torch, cuda versions: Python - 3.7.2 torch - 1.0.0 cuda - 9.0.176 I am attempting to run the code on a system with Tesla K40c, with 12GB of memory

I'd greatly appreciate help in trying to figure out what is going wrong.

Thanks!

chenliu-wustl commented 5 years ago

Could you please check the value range of all_coords (all_coords.min(0) and all_coords.max(0)). The all_coords should have a shape of Nx4 and all_coords.min(0)[:3] should be greater than 0, all_coords.max(0)[:3] should be smaller than 4096 and all_coords.min(0)[3] = all_coords.max(0)[3] = 0.

mhalber commented 5 years ago

Hi - thank you for your reply.

Turns out the fault has been a bit on my side - I think the issue has been due to the python version mismatch. SparseConvNet github page mentions the use of python 3.6.8, so I've switched to that version of python. Additionally, I've noticed mismatch between nvcc version and cuda version in torch on my computer.

After these two changes, the network seems to be training without issues.

I think it would be nice if README.md mentioned the required CUDA/python versions, as without SparseConvNet page I'd be lost.

Anyway, thanks again for the help and I will close the issue.