Julie-tang00 / Point-BERT

[CVPR 2022] Pre-Training 3D Point Cloud Transformers with Masked Point Modeling
MIT License
541 stars 65 forks source link

segmentation fault while evaluating #13

Closed leeyegy closed 2 years ago

leeyegy commented 2 years ago

when start to validate for pretraining, segmentation fault occurs:

...... 2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.transformer_config.return_all_tokens : False2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.dvae_config = edict()2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.dvae_config.group_size : 322021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.num_group : 64 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.encoder_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.num_tokens : 8192 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.tokens_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.decoder_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.ckpt : pretrain/dVAE.pth 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.total_bs : 128 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.step_per_update : 1 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.max_epoch : 300 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.consider_metric : CDL1 2021-12-27 11:32:16,068 - Point-BERT - INFO - Distributed training: False 2021-12-27 11:32:16,068 - Point-BERT - INFO - Set random seed to 0, deterministic: False 2021-12-27 11:32:16,073 - ShapeNet-55 - INFO - [DATASET] sample out 1024 points 2021-12-27 11:32:16,074 - ShapeNet-55 - INFO - [DATASET] Open file /mnt/cache/liyanjie/data/pointcloud/ShapeNet55-34/ShapeNet-55/train.txt 2021-12-27 11:32:16,101 - ShapeNet-55 - INFO - [DATASET] Open file /mnt/cache/liyanjie/data/pointcloud/ShapeNet55-34/ShapeNet-55/test.txt 2021-12-27 11:32:16,174 - ShapeNet-55 - INFO - [DATASET] 52470 instances were loaded 2021-12-27 11:32:16,191 - ModelNet - INFO - The size of test data is 2468 2021-12-27 11:32:16,192 - ModelNet - INFO - Load processed data from /mnt/cache/liyanjie/data/pointcloud/ModelNet/modelnet40_normal_resampled/modelnet40_test_8192pts_fps.dat... 2021-12-27 11:32:16,993 - ModelNet - INFO - The size of train data is 9843 2021-12-27 11:32:16,994 - ModelNet - INFO - Load processed data from /mnt/cache/liyanjie/data/pointcloud/ModelNet/modelnet40_normal_resampled/modelnet40_train_8192pts_fps.dat... 2021-12-27 11:32:19,511 - Point_BERT - INFO - [Point_BERT] build dVAE_BERT ... 2021-12-27 11:32:19,511 - Point_BERT - INFO - [Point_BERT] Point_BERT [NOT] calc the loss for all token ... 2021-12-27 11:32:19,511 - dVAE BERT - INFO - [Transformer args] {'mask_ratio': [0.25, 0.45], 'trans_dim': 384, 'depth': 12, 'drop_path_rate': 0.1, 'cls_dim': 512, 'replace_pob': 0.0, 'num_heads' : 6, 'moco_loss': False, 'dvae_loss': True, 'cutmix_loss': True, 'return_all_tokens': False} 2021-12-27 11:32:22,489 - dVAE BERT - INFO - [Encoder] Successful Loading the ckpt for encoder from pretrain/dVAE.pth 2021-12-27 11:32:22,521 - dVAE BERT - INFO - [Transformer args] {'mask_ratio': [0.25, 0.45], 'trans_dim': 384, 'depth': 12, 'drop_path_rate': 0.1, 'cls_dim': 512, 'replace_pob': 0.0, 'num_heads' : 6, 'moco_loss': False, 'dvae_loss': True, 'cutmix_loss': True, 'return_all_tokens': False} 2021-12-27 11:32:23,612 - Point_BERT - INFO - [dVAE] Successful Loading the ckpt for dvae from pretrain/dVAE.pth 2021-12-27 11:32:23,638 - Point_BERT - INFO - [Point_BERT Group] cutmix_BERT divide point cloud into G64 x S32 points ...2021-12-27 11:32:34,314 - Point-BERT - INFO - [RESUME INFO] Loading model weights from ./experiments/Point-BERT/Mixup_models/pointBERT_pretrain/ckpt-last.pth...2021-12-27 11:32:38,565 - Point-BERT - INFO - [RESUME INFO] resume ckpts @ 9 epoch( best_metrics = {'acc': 0.0})2021-12-27 11:32:38,566 - Point-BERT - INFO - Using Data parallel ...2021-12-27 11:32:38,591 - Point-BERT - INFO - [RESUME INFO] Loading optimizer from ./experiments/Point-BERT/Mixup_models/pointBERT_pretrain/ckpt-last.pth...2021-12-27 11:32:39,640 - Point-BERT - INFO - [VALIDATION] Start validating epoch 10 error: Segmentation fault

Any suggestions would be deeply appreciated~

leeyegy commented 2 years ago

Environment: CUDA-10.0 GCC-7.3 Torch-1.8

leeyegy commented 2 years ago

This task is operated on a single V100.

yuxumin commented 2 years ago

Hi, did the error occur in the linearSVM evaluation?

leeyegy commented 2 years ago

Yes, I believe so. When I comment out the line of linearSVM evaluation, the segmentation fault doesn't happen.

yuxumin commented 2 years ago

It happened when i trained the model on Nvidia 3090 but everything goes well when i use Nvidia 2080Ti.

leeyegy commented 2 years ago

ORZ. does it mean that only 2080ti is supported so far? It's so unfortunate because V100s are all I have.

yuxumin commented 2 years ago

Emmm. In fact, the SVM evaluation in MPM pre-training is just a hint of whether the Transformer is well trained. (We use the results of linearSVM to tune the pipeline) I think it doesn't matter to pre-train the Point-BERT without LinearSVM evaluation. Set --val_freq 500 to avoid evaluation during your training and save the time for MPM training.

leeyegy commented 2 years ago

It seems that this is the fastest solution. Thanks for your patience a lot.

wulalala1999 commented 2 years ago

I use cuda11.2 torch1.7.0+cu110 torchvision0.8.1+cu110 on NVIDIA 3090 and the linearSVM can work well.

bhavyagoyal commented 2 years ago

@yuxumin I am also facing segmentation fault error during linearSVM evaluation (I am using A100 gpu with cuda 11.3). Were you able to debug this issue?