Closed leeyegy closed 2 years ago
Environment: CUDA-10.0 GCC-7.3 Torch-1.8
This task is operated on a single V100.
Hi, did the error occur in the linearSVM evaluation?
Yes, I believe so. When I comment out the line of linearSVM evaluation, the segmentation fault doesn't happen.
It happened when i trained the model on Nvidia 3090 but everything goes well when i use Nvidia 2080Ti.
ORZ. does it mean that only 2080ti is supported so far? It's so unfortunate because V100s are all I have.
Emmm. In fact, the SVM evaluation in MPM pre-training is just a hint of whether the Transformer is well trained. (We use the results of linearSVM to tune the pipeline)
I think it doesn't matter to pre-train the Point-BERT without LinearSVM evaluation. Set --val_freq 500
to avoid evaluation during your training and save the time for MPM training.
It seems that this is the fastest solution. Thanks for your patience a lot.
I use cuda11.2 torch1.7.0+cu110 torchvision0.8.1+cu110 on NVIDIA 3090 and the linearSVM can work well.
@yuxumin I am also facing segmentation fault error during linearSVM evaluation (I am using A100 gpu with cuda 11.3). Were you able to debug this issue?
when start to validate for pretraining, segmentation fault occurs:
...... 2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.transformer_config.return_all_tokens : False2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.dvae_config = edict()2021-12-27 11:32:16,067 - Point-BERT - INFO - config.model.dvae_config.group_size : 322021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.num_group : 64 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.encoder_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.num_tokens : 8192 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.tokens_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.decoder_dims : 256 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.model.dvae_config.ckpt : pretrain/dVAE.pth 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.total_bs : 128 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.step_per_update : 1 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.max_epoch : 300 2021-12-27 11:32:16,068 - Point-BERT - INFO - config.consider_metric : CDL1 2021-12-27 11:32:16,068 - Point-BERT - INFO - Distributed training: False 2021-12-27 11:32:16,068 - Point-BERT - INFO - Set random seed to 0, deterministic: False 2021-12-27 11:32:16,073 - ShapeNet-55 - INFO - [DATASET] sample out 1024 points 2021-12-27 11:32:16,074 - ShapeNet-55 - INFO - [DATASET] Open file /mnt/cache/liyanjie/data/pointcloud/ShapeNet55-34/ShapeNet-55/train.txt 2021-12-27 11:32:16,101 - ShapeNet-55 - INFO - [DATASET] Open file /mnt/cache/liyanjie/data/pointcloud/ShapeNet55-34/ShapeNet-55/test.txt 2021-12-27 11:32:16,174 - ShapeNet-55 - INFO - [DATASET] 52470 instances were loaded 2021-12-27 11:32:16,191 - ModelNet - INFO - The size of test data is 2468 2021-12-27 11:32:16,192 - ModelNet - INFO - Load processed data from /mnt/cache/liyanjie/data/pointcloud/ModelNet/modelnet40_normal_resampled/modelnet40_test_8192pts_fps.dat... 2021-12-27 11:32:16,993 - ModelNet - INFO - The size of train data is 9843 2021-12-27 11:32:16,994 - ModelNet - INFO - Load processed data from /mnt/cache/liyanjie/data/pointcloud/ModelNet/modelnet40_normal_resampled/modelnet40_train_8192pts_fps.dat... 2021-12-27 11:32:19,511 - Point_BERT - INFO - [Point_BERT] build dVAE_BERT ... 2021-12-27 11:32:19,511 - Point_BERT - INFO - [Point_BERT] Point_BERT [NOT] calc the loss for all token ... 2021-12-27 11:32:19,511 - dVAE BERT - INFO - [Transformer args] {'mask_ratio': [0.25, 0.45], 'trans_dim': 384, 'depth': 12, 'drop_path_rate': 0.1, 'cls_dim': 512, 'replace_pob': 0.0, 'num_heads' : 6, 'moco_loss': False, 'dvae_loss': True, 'cutmix_loss': True, 'return_all_tokens': False} 2021-12-27 11:32:22,489 - dVAE BERT - INFO - [Encoder] Successful Loading the ckpt for encoder from pretrain/dVAE.pth 2021-12-27 11:32:22,521 - dVAE BERT - INFO - [Transformer args] {'mask_ratio': [0.25, 0.45], 'trans_dim': 384, 'depth': 12, 'drop_path_rate': 0.1, 'cls_dim': 512, 'replace_pob': 0.0, 'num_heads' : 6, 'moco_loss': False, 'dvae_loss': True, 'cutmix_loss': True, 'return_all_tokens': False} 2021-12-27 11:32:23,612 - Point_BERT - INFO - [dVAE] Successful Loading the ckpt for dvae from pretrain/dVAE.pth 2021-12-27 11:32:23,638 - Point_BERT - INFO - [Point_BERT Group] cutmix_BERT divide point cloud into G64 x S32 points ...2021-12-27 11:32:34,314 - Point-BERT - INFO - [RESUME INFO] Loading model weights from ./experiments/Point-BERT/Mixup_models/pointBERT_pretrain/ckpt-last.pth...2021-12-27 11:32:38,565 - Point-BERT - INFO - [RESUME INFO] resume ckpts @ 9 epoch( best_metrics = {'acc': 0.0})2021-12-27 11:32:38,566 - Point-BERT - INFO - Using Data parallel ...2021-12-27 11:32:38,591 - Point-BERT - INFO - [RESUME INFO] Loading optimizer from ./experiments/Point-BERT/Mixup_models/pointBERT_pretrain/ckpt-last.pth...2021-12-27 11:32:39,640 - Point-BERT - INFO - [VALIDATION] Start validating epoch 10 error: Segmentation fault
Any suggestions would be deeply appreciated~