training crash - Githubissues

dringlourious commented 3 years ago

Hello,

I tried to use waymo datasets to pretrain the model, however, I got the following error. Could you please check how to fix it? Thank you very much.

============================== Args ============================== cfg configs/point_within_lidar_template.yaml quiet False world_size 1 rank 0 dist_url tcp://localhost:15475 dist_backend nccl seed None gpu 0 ngpus 1 multiprocessing_distributed False Traceback (most recent call last): File "main.py", line 190, in main() File "main.py", line 70, in main main_worker(args.gpu, ngpus_per_node, args, cfg) File "main.py", line 81, in main_worker model = main_utils.build_model(cfg['model'], logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/utils/main_utils.py", line 142, in build_model return models.build_model(cfg, logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/init.py", line 11, in build_model return BaseSSLMultiInputOutputModel(model_config, logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 58, in init self.trunk = self._get_trunk() File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 275, in _get_trunk trunks.append(models.TRUNKSself.config['arch_point']) TypeError: 'NoneType' object is not callable

imisra commented 3 years ago

@zaiweizhang - Can you take a look?

zaiweizhang commented 3 years ago

For pretraining with lidar point clouds, we use models from OpenPCDet, in order to run the pretraining, you need to git clone OpenPCDet to third-party and install it. Did you git clone OpenPCDet to third-party and install it? (You can add a --recursive flag when cloning our repo. It will clone our folked version.) It looks like the error is because of that.

dringlourious commented 3 years ago

For pretraining with lidar point clouds, we use models from OpenPCDet, in order to run the pretraining, you need to git clone OpenPCDet to third-party and install it. Did you git clone OpenPCDet to third-party and install it? (You can add a --recursive flag when cloning our repo. It will clone our folked version.) It looks like the error is because of that.

Thanks for answering. I installed pcdet 0.3.0 in my venv (using python setup.py develop as the repo readme), but I still got the error. Does cloning OpenPcdet into third_party really matters? Actually I did clone it agine and install pcdet again, but the error is still there.

zaiweizhang commented 3 years ago

So in here, I made a try catch to avoid issues people may get from running the script in different environments. Would you mind commenting out that try catch, and run your script again? That should give me a sense what's not working. Thanks!

dringlourious commented 3 years ago

Thanks for the answer. I commented the try and except here, and got the following error:

============================== Args ============================== cfg configs/point_within_lidar_template.yaml quiet False world_size 1 rank 0 dist_url tcp://localhost:15475 dist_backend nccl seed None gpu 0 ngpus 1 multiprocessing_distributed False Traceback (most recent call last): File "/mnt/Titan/git_repos/open_repos/DepthContrast/third_party/pointnet2/pointnet2_utils.py", line 26, in import pointnet2._ext as _ext ModuleNotFoundError: No module named 'pointnet2'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 190, in main() File "main.py", line 70, in main main_worker(args.gpu, ngpus_per_node, args, cfg) File "main.py", line 81, in main_worker model = main_utils.build_model(cfg['model'], logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/utils/main_utils.py", line 142, in build_model return models.build_model(cfg, logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/init.py", line 11, in build_model return BaseSSLMultiInputOutputModel(model_config, logger) File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 58, in init self.trunk = self._get_trunk() File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 271, in _get_trunk import models.trunks as models File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/trunks/init.py", line 9, in from models.trunks.pointnet import PointNet File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/trunks/pointnet.py", line 20, in from pointnet2_modules import PointnetSAModuleVotes, PointnetFPModule File "/mnt/Titan/git_repos/open_repos/DepthContrast/third_party/pointnet2/pointnet2_modules.py", line 21, in import pointnet2_utils File "/mnt/Titan/git_repos/open_repos/DepthContrast/third_party/pointnet2/pointnet2_utils.py", line 30, in "Could not import _ext module.\n" ImportError: Could not import _ext module. Please see the setup instructions in the README: https://github.com/erikwijmans/Pointnet2_PyTorch/blob/master/README.rst

pointnet 2 import error occurs.

zaiweizhang commented 3 years ago

So for the four imports:

#### Following two are for scannet+redwood pretraining
from models.trunks.pointnet import PointNet
from models.trunks.spconv.models.res16unet import Res16UNet34

#### Following two are for waymo pretraining
from models.trunks.pointnet2_backbone import PointNet2MSG
from models.trunks.spconv_unet import UNetV2_concat as UNetV2

You should not change the try/catch for the first two imports. For the error, it maybe that you did not fully install the OpenPCDet in the third party. If you did install OpenPCDet correctly, then just perform the following:

cd third_party/pointnet2 python setup.py install

dringlourious commented 3 years ago

Thanks for the patient answer and I finally fixed the problem. I think there is a misunderstanding.

After I checked the installation of pcdet and pointnet2 (using python setup.py method), I find I have to change the script here, line 21 as: import pcdet.ops.pointnet2.pointnet2_batch import pointnet2_modules, since ops is not my PYTHONPATH in my environment if using setup.py method for pcdet installation.

Thank you for the time for your support, but I think a more detailed readme would be more beneficial for the watchers.

After fixing the problem I initiated the training and get the following training log:

WARNING:root:Distributed trainer not initialized. Not using the sampler and data will NOT be shuffled

============================== Train data ============================== <datasets.depth_dataset.DepthContrastDataset object at 0x7ff70350a610> {'loss_type': 'cross_entropy', 'name': 'NCELossMoco'} ============================== Epoch 0 ============================== /home/shawn/.python_virtual_env/3dConstrastive/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:509: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). "please use get_last_lr().", UserWarning) LR: [0.12]

train: Epoch 0 2021-01-26 18:28:53.167325 | train [0][ 1/311] Time 5.985 ( 5.985) Data 3.090 ( 3.090) Loss 1.942e+00 (1.942e+00) Loss_npid1 1.942e+00 (1.942e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:54.054138 | train [0][ 5/311] Time 0.223 ( 1.374) Data 0.008 ( 0.620) Loss 3.717e+00 (3.323e+00) Loss_npid1 3.717e+00 (3.323e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:55.194466 | train [0][ 10/311] Time 0.233 ( 0.801) Data 0.001 ( 0.311) Loss 4.372e+00 (3.719e+00) Loss_npid1 4.372e+00 (3.719e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:56.525938 | train [0][ 15/311] Time 0.228 ( 0.623) Data 0.042 ( 0.213) Loss 4.771e+00 (4.018e+00) Loss_npid1 4.771e+00 (4.018e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00)

I would like to ask if the training is going the right way? Thanks

zaiweizhang commented 3 years ago

Thank you for bearing with me also :) I will try to revise the Readme.

I think you might be using this command to run your code: python main.py /path/to/cfg_file However, that should be mainly used for debugging purpose.

I would suggest that once your program is running (currently looks fine to me), try to use the following command to run the program in a distributed setting: python main.py /path/to/cfg_file --multiprocessing-distributed --world-size 1 --rank 0 --ngpus number_of_gpus

Just set --ngpus to be 1 for 1GPU setting.

Otherwise, like what's said in here: WARNING:root:Distributed trainer not initialized. Not using the sampler and data will NOT be shuffled

Sorry for the confusion! I will try to revise the Readme.

dringlourious commented 3 years ago

thanks for the advice. I think this issue can be closed.

facebookresearch / DepthContrast

training crash #2