dvlab-research / PointGroup

PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation
Apache License 2.0
379 stars 81 forks source link

Error when training: "subprocess.CalledProcessError: Command '['ninja']' returned non-zero exit status 245." #57

Closed timsu1104 closed 2 years ago

timsu1104 commented 2 years ago

I set up the environment a couple days ago and everything runs ok until I attempted to change the permission of the data and anaconda3 in order to run my code using another account. I used "chmod -R 777 /share/suzhengyuan/" (where I saved scannet data and pointgroup's source codes) and "chmod -R 777 /share/anaconda3/" and suddenly the code doesn't work anymore.

I have tried to reinstall spconv, ninja, ccimport, and pccm, but all of them don't fix the problem. It seems to be relevant to spconv, where I cloned spconv2.2.21 and overwrite spconv/pytorch with the modified file provided in PointGroup/lib/spconv before making some changes on headers(e.g., spconv->spconv.pytorch, import spconv.pytorch as spconv, etc. ) to adapt them. This used to work before I modified the permission. I really have no idea what was going on. Have anyone ever met similar problems? How can I fix it?

My configuration: CUDA11.1 torch 1.9.0+cu111 pccm 0.3.4 ccimport 0.3.7 ninja 1.10.2.3 cumm 0.3.0

The error: [2022-02-15 03:27:07,589 INFO train.py line 26 3272560] Namespace(TEST_NMS_THRESH=0.3, TEST_NPOINT_THRESH=100, TEST_SCORE_THRESH=0.09, batch_size=4, bg_thresh=0.25, block_reps=2, block_residual=True, classes=20, cluster_meanActive=50, cluster_npoint_thre=50, cluster_radius=0.03, cluster_shift_meanActive=300, config='/share/suzhengyuan/ScanNetv2/PointGroup/config/pointgroup_run1_scannet.yaml', data_root='dataset', dataset='scannetv2', dataset_dir='data/scannetv2_inst.py', epochs=384, eval=True, exp_path='exp/scannetv2/pointgroup/pointgroup_run1_scannet', fg_thresh=0.75, filename_suffix='_inst_nostuff.pth', fix_module=[], full_scale=[128, 512], ignore_label=-100, input_channel=3, loss_weight=[1.0, 1.0, 1.0, 1.0], lr=0.001, m=16, manual_seed=123, max_npoint=250000, mode=4, model_dir='model/pointgroup/pointgroup.py', model_name='pointgroup', momentum=0.9, multiplier=0.5, optim='Adam', prepare_epochs=128, pretrain='', pretrain_module=[], pretrain_path=None, save_freq=16, save_instance=False, save_pt_offsets=False, save_semantic=False, scale=50, score_fullscale=14, score_mode=4, score_scale=50, split='val', step_epoch=384, task='train', test_epoch=384, test_seed=567, test_workers=16, train_workers=16, use_coords=True, weight_decay=0.0001) [2022-02-15 03:27:07,592 INFO train.py line 135 3272560] => creating model ... Traceback (most recent call last): File "train.py", line 138, in from model.pointgroup.pointgroup_orig import PointGroup as Network File "/share/suzhengyuan/ScanNetv2/PointGroup/model/pointgroup/pointgroup_orig.py", line 8, in import spconv.pytorch as spconv File "/share/suzhengyuan/ScanNetv2/spconv/spconv/init.py", line 15, in from . import build as _build File "/share/suzhengyuan/ScanNetv2/spconv/spconv/build.py", line 49, in load_library=False) File "/share/anaconda3/envs/pointgroup/lib/python3.7/site-packages/pccm/builder/pybind.py", line 141, in build_pybind objects_folder=objects_folder) File "/share/anaconda3/envs/pointgroup/lib/python3.7/site-packages/ccimport/core.py", line 182, in ccimport linker_to_path=linker_to_path) File "/share/anaconda3/envs/pointgroup/lib/python3.7/site-packages/ccimport/buildtools/writer.py", line 997, in build_simple_ninja raise subprocess.CalledProcessError(proc.returncode, cmds) subprocess.CalledProcessError: Command '['ninja']' returned non-zero exit status 245.

timsu1104 commented 2 years ago

Problem solved. There was something wrong with my spconv library. After I reinstalled it with "pip --no-cache-dir install" and modified functional.py, everything seemed okay again.