Closed dringlourious closed 3 years ago
@zaiweizhang - Can you take a look?
For pretraining with lidar point clouds, we use models from OpenPCDet, in order to run the pretraining, you need to git clone OpenPCDet to third-party and install it. Did you git clone OpenPCDet to third-party and install it? (You can add a --recursive flag when cloning our repo. It will clone our folked version.) It looks like the error is because of that.
For pretraining with lidar point clouds, we use models from OpenPCDet, in order to run the pretraining, you need to git clone OpenPCDet to third-party and install it. Did you git clone OpenPCDet to third-party and install it? (You can add a --recursive flag when cloning our repo. It will clone our folked version.) It looks like the error is because of that.
Thanks for answering. I installed pcdet 0.3.0 in my venv (using python setup.py develop as the repo readme), but I still got the error. Does cloning OpenPcdet into third_party really matters? Actually I did clone it agine and install pcdet again, but the error is still there.
So in here, I made a try catch to avoid issues people may get from running the script in different environments. Would you mind commenting out that try catch, and run your script again? That should give me a sense what's not working. Thanks!
Thanks for the answer. I commented the try and except here, and got the following error:
============================== Args ==============================
cfg configs/point_within_lidar_template.yaml
quiet False
world_size 1
rank 0
dist_url tcp://localhost:15475
dist_backend nccl
seed None
gpu 0
ngpus 1
multiprocessing_distributed False
Traceback (most recent call last):
File "/mnt/Titan/git_repos/open_repos/DepthContrast/third_party/pointnet2/pointnet2_utils.py", line 26, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 190, in
pointnet 2 import error occurs.
So for the four imports:
#### Following two are for scannet+redwood pretraining
from models.trunks.pointnet import PointNet
from models.trunks.spconv.models.res16unet import Res16UNet34
#### Following two are for waymo pretraining
from models.trunks.pointnet2_backbone import PointNet2MSG
from models.trunks.spconv_unet import UNetV2_concat as UNetV2
You should not change the try/catch for the first two imports. For the error, it maybe that you did not fully install the OpenPCDet in the third party. If you did install OpenPCDet correctly, then just perform the following:
cd third_party/pointnet2 python setup.py install
Thanks for the patient answer and I finally fixed the problem. I think there is a misunderstanding.
After I checked the installation of pcdet and pointnet2 (using python setup.py method), I find I have to change the script here, line 21 as: import pcdet.ops.pointnet2.pointnet2_batch import pointnet2_modules, since ops is not my PYTHONPATH in my environment if using setup.py method for pcdet installation.
Thank you for the time for your support, but I think a more detailed readme would be more beneficial for the watchers.
After fixing the problem I initiated the training and get the following training log:
FP_modules.2.mlp.1.weight | Frozen | 512 | 512 FP_modules.2.mlp.1.bias | Frozen | 512 | 512 FP_modules.2.mlp.3.weight | Frozen | 512 x 512 x 1 x 1 | 262144 FP_modules.2.mlp.4.weight | Frozen | 512 | 512 FP_modules.2.mlp.4.bias | Frozen | 512 | 512 FP_modules.3.mlp.0.weight | Frozen | 512 x 1536 x 1 x 1 | 786432 FP_modules.3.mlp.1.weight | Frozen | 512 | 512 FP_modules.3.mlp.1.bias | Frozen | 512 | 512 FP_modules.3.mlp.3.weight | Frozen | 512 x 512 x 1 x 1 | 262144 FP_modules.3.mlp.4.weight | Frozen | 512 | 512 FP_modules.3.mlp.4.bias | Frozen | 512 | 512 head.clf.0.weight | Frozen | 128 x 128 | 16384 head.clf.0.bias | Frozen | 128 | 128 head.clf.2.weight | Frozen | 128 x 128 | 16384 head.clf.2.bias | Frozen | 128 | 128
WARNING:root:Distributed trainer not initialized. Not using the sampler and data will NOT be shuffled
============================== Train data ==============================
<datasets.depth_dataset.DepthContrastDataset object at 0x7ff70350a610>
{'loss_type': 'cross_entropy', 'name': 'NCELossMoco'}
============================== Epoch 0 ==============================
/home/shawn/.python_virtual_env/3dConstrastive/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:509: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr()
.
"please use get_last_lr()
.", UserWarning)
LR: [0.12]
train: Epoch 0 2021-01-26 18:28:53.167325 | train [0][ 1/311] Time 5.985 ( 5.985) Data 3.090 ( 3.090) Loss 1.942e+00 (1.942e+00) Loss_npid1 1.942e+00 (1.942e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:54.054138 | train [0][ 5/311] Time 0.223 ( 1.374) Data 0.008 ( 0.620) Loss 3.717e+00 (3.323e+00) Loss_npid1 3.717e+00 (3.323e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:55.194466 | train [0][ 10/311] Time 0.233 ( 0.801) Data 0.001 ( 0.311) Loss 4.372e+00 (3.719e+00) Loss_npid1 4.372e+00 (3.719e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00) 2021-01-26 18:28:56.525938 | train [0][ 15/311] Time 0.228 ( 0.623) Data 0.042 ( 0.213) Loss 4.771e+00 (4.018e+00) Loss_npid1 4.771e+00 (4.018e+00) Loss_npid2 0.000e+00 (0.000e+00) Loss_cmc1 0.000e+00 (0.000e+00) Loss_cmc2 0.000e+00 (0.000e+00)
I would like to ask if the training is going the right way? Thanks
Thank you for bearing with me also :) I will try to revise the Readme.
I think you might be using this command to run your code: python main.py /path/to/cfg_file However, that should be mainly used for debugging purpose.
I would suggest that once your program is running (currently looks fine to me), try to use the following command to run the program in a distributed setting: python main.py /path/to/cfg_file --multiprocessing-distributed --world-size 1 --rank 0 --ngpus number_of_gpus
Just set --ngpus to be 1 for 1GPU setting.
Otherwise, like what's said in here: WARNING:root:Distributed trainer not initialized. Not using the sampler and data will NOT be shuffled
Sorry for the confusion! I will try to revise the Readme.
thanks for the advice. I think this issue can be closed.
Hello,
I tried to use waymo datasets to pretrain the model, however, I got the following error. Could you please check how to fix it? Thank you very much.
============================== Args ============================== cfg configs/point_within_lidar_template.yaml quiet False world_size 1 rank 0 dist_url tcp://localhost:15475 dist_backend nccl seed None gpu 0 ngpus 1 multiprocessing_distributed False Traceback (most recent call last): File "main.py", line 190, in
main()
File "main.py", line 70, in main
main_worker(args.gpu, ngpus_per_node, args, cfg)
File "main.py", line 81, in main_worker
model = main_utils.build_model(cfg['model'], logger)
File "/mnt/Titan/git_repos/open_repos/DepthContrast/utils/main_utils.py", line 142, in build_model
return models.build_model(cfg, logger)
File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/init.py", line 11, in build_model
return BaseSSLMultiInputOutputModel(model_config, logger)
File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 58, in init
self.trunk = self._get_trunk()
File "/mnt/Titan/git_repos/open_repos/DepthContrast/models/base_ssl3d_model.py", line 275, in _get_trunk
trunks.append(models.TRUNKSself.config['arch_point'])
TypeError: 'NoneType' object is not callable