Custom dataset evaluate OOM

GondorFu commented 1 year ago

When I use this model on custom dataset, it is normal in training phase，but once in evaluation phase, it's always encountering out of memory for GPU.

What's the possible reason for this?

Gofinge commented 1 year ago

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

GondorFu commented 1 year ago

Hi, could you show me your custom dataset config (whole exported config in exp folder is better) and OOM output?

These are config and error log

Config: weight = None resume = False evaluate = True test_only = False seed = 318105 save_path = num_worker = 16 batch_size = 8 batch_size_val = None batch_size_test = 1 epoch = 100 eval_epoch = 100 save_freq = None eval_metric = 'mIoU' sync_bn = False enable_amp = True empty_cache = False find_unused_parameters = False max_batch_points = 100000000.0 mix_prob = 0.8 param_dicts = None test = dict(type='SegmentationTest') model = dict( type='ptv2m2', in_channels=6, num_classes= patch_embed_depth=2, patch_embed_channels=48, patch_embed_groups=6, patch_embed_neighbours=16, enc_depths=(2, 6, 2), enc_channels=(96, 192, 384), enc_groups=(12, 24, 48), enc_neighbours=(16, 16, 16), dec_depths=(1, 1, 1), dec_channels=(48, 96, 192), dec_groups=(6, 12, 24), dec_neighbours=(16, 16, 16), grid_sizes=(0.1, 0.2, 0.4), attn_qkv_bias=True, pe_multiplier=False, pe_bias=True, attn_drop_rate=0.0, drop_path_rate=0.3, enable_checkpoint=False, unpool_backend='interp') optimizer = dict(type='AdamW', lr=0.006, weight_decay=0.05) scheduler = dict(type='MultiStepLR', milestones=[0.6, 0.8], gamma=0.1) dataset_type = 'AutoScenesDataset' data_root = ' data = dict( num_classes= ignore_label=255, names=[' train=dict( type= split='train', data_root=' transform=[ dict(type='CenterShift', apply_z=True), dict(type='RandomScale', scale=[0.9, 1.1]), dict(type='RandomFlip', p=0.5), dict(type='RandomJitter', sigma=0.005, clip=0.02), dict(type='ChromaticAutoContrast', p=0.2, blend_factor=None), dict(type='ChromaticTranslation', p=0.95, ratio=0.05), dict(type='ChromaticJitter', p=0.95, std=0.05), dict( type='Voxelize', voxel_size=0.04, hash_type='fnv', mode='train', keys=('coord', 'color', 'label'), return_discrete_coord=True), dict(type='SphereCrop', point_max=100000, mode='random'), dict(type='CenterShift', apply_z=False), dict(type='NormalizeColor'), dict(type='ToTensor'), dict( type='Collect', keys=('coord', 'discrete_coord', 'label'), feat_keys=['coord', 'color']) ], test_mode=False, loop=1), val=dict( type=' split='val', data_root=' transform=[ dict(type='CenterShift', apply_z=True), dict( type='Copy', keys_dict=dict(coord='origin_coord', label='origin_label')), dict( type='Voxelize', voxel_size=0.04, hash_type='fnv', mode='train', keys=('coord', 'color', 'label'), return_discrete_coord=True), dict(type='CenterShift', apply_z=False), dict(type='NormalizeColor'), dict(type='ToTensor'), dict( type='Collect', keys=('coord', 'discrete_coord', 'label'), offset_keys_dict=dict(offset='coord'), feat_keys=['coord', 'color']) ], test_mode=False), test=dict( type=' split='test', data_root=' transform=[ dict(type='CenterShift', apply_z=True), dict(type='NormalizeColor') ], test_mode=True, test_cfg=dict( voxelize=dict( type='Voxelize', voxel_size=0.04, hash_type='fnv', mode='test', keys=('coord', 'color'), return_discrete_coord=True), crop=None, post_transform=[ dict(type='CenterShift', apply_z=False), dict(type='ToTensor'), dict( type='Collect', keys=('coord', 'discrete_coord', 'index'), feat_keys=('coord', 'color')) ], aug_transform=[[{ 'type': 'RandomScale', 'scale': [0.9, 0.9] }], [{ 'type': 'RandomScale', 'scale': [0.95, 0.95] }], [{ 'type': 'RandomScale', 'scale': [1, 1] }], [{ 'type': 'RandomScale', 'scale': [1.05, 1.05] }], [{ 'type': 'RandomScale', 'scale': [1.1, 1.1] }], [{ 'type': 'RandomScale', 'scale': [0.9, 0.9] }, { 'type': 'RandomFlip', 'p': 1 }], [{ 'type': 'RandomScale', 'scale': [0.95, 0.95] }, { 'type': 'RandomFlip', 'p': 1 }], [{ 'type': 'RandomScale', 'scale': [1, 1] }, { 'type': 'RandomFlip', 'p': 1 }], [{ 'type': 'RandomScale', 'scale': [1.05, 1.05] }, { 'type': 'RandomFlip', 'p': 1 }], [{ 'type': 'RandomScale', 'scale': [1.1, 1.1] }, { 'type': 'RandomFlip', 'p': 1 }]]))) criteria = [dict(type='CrossEntropyLoss', loss_weight=1.0, ignore_index=255)] num_worker_per_gpu = 2 batch_size_per_gpu = 1 batch_size_val_per_gpu = 1

Start Evaluation >>>>>>>>>>>>>>>> python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown len(cache)) python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown len(cache)) python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 28 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "tools/train.py", line 34, in main() File "tools/train.py", line 29, in main cfg=(cfg,), File "PointTransformerV2/pcr/engines/launch.py", line 84, in launch daemon=False, File python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 3 terminated with the following error: Traceback (most recent call last): File "python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "PointTransformerV2/pcr/engines/launch.py", line 183, in _distributed_worker main_func(cfg) File "PointTransformerV2/tools/train.py", line 16, in main_worker trainer.train() File "PointTransformerV2/pcr/engines/defaults.py", line 216, in train self.after_epoch() File "PointTransformerV2/pcr/engines/defaults.py", line 321, in after_epoch self.eval() File "PointTransformerV2/pcr/engines/defaults.py", line 334, in eval output = self.model(input_dict) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "python3.7/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 517, in forward points = self.patch_embed(points) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 411, in forward return self.blocks([coord, feat, offset]) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, *kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 209, in forward points = block(points, reference_index) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 157, in forward if not self.enable_checkpoint else checkpoint(self.attn, feat, coord, reference_index) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 108, in forward peb = self.linear_p_bias(pos) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, *kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "python3.7/site-packages/torch/nn/modules/container.py", line 141, in forward input = module(input) File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(*args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "PointTransformerV2/pcr/models/point_transformer2/point_transformer_v2m2_base.py", line 37, in forward return self.norm(input.transpose(1, 2).contiguous()).transpose(1, 2).contiguous() File "python3.7/site-packages/sampler/utils/wrapper.py", line 21, in run res = func(args, kwargs) File "python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "python3.7/site-packages/torch/nn/modules/batchnorm.py", line 179, in forward self.eps, File "python3.7/site-packages/torch/nn/functional.py", line 2283, in batch_norm input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 5.38 GiB (GPU 3; 31.75 GiB total capacity; 28.67 GiB already allocated; 1.47 GiB free; 28.80 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Gofinge commented 1 year ago

The config looks good. Is there any potential that the validation point clouds are significantly larger than the training point clouds? What about data_dict["coord"].shape? (It would be helpful if you can log it out before OOM)

GondorFu commented 1 year ago

I random split the train and val, so there won't be much difference. You can see before OOM, the code try to allocate 5.38 GB, in my custom dataset, each pc only about 100M and just one for each GPU, what do you think is the reason why it need to allocate such a large amount of memory?

Gofinge commented 1 year ago

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

GondorFu commented 1 year ago

I am sorry about that issue. I never encounter a similar problem. I notice that the validation batch size per GPU is identical to the train batch size per GPU (both 1). The memory consumption of the evaluation process should be much lower than the training process.

For debugging this issue, my suggestion is to print out the input shape before feeding it into the model.

this is the eval size

Start Evaluation >>>>>>>>>>>>>>>> val size >>>>>>>>>>>>>>>: 655712 val size >>>>>>>>>>>>>>>: 872887 val size >>>>>>>>>>>>>>>: 871667 val size >>>>>>>>>>>>>>>: 1273970 val size >>>>>>>>>>>>>>>: 1541887 val size >>>>>>>>>>>>>>>: 1918415 val size >>>>>>>>>>>>>>>: 1695826 val size >>>>>>>>>>>>>>>: 2831842

Gofinge commented 1 year ago

Hi, that was quite a huge number for a point cloud after voxelization. Maybe you can further validate whether the validation point cloud voxelized successfully.

Pointcept / PointTransformerV2

Custom dataset evaluate OOM #15