Closed yxchng closed 2 months ago
It can be caused by the OOM (out-of-memory) problem. What kind of GPU do you use? Maybe you can try to reduce the batch size in the config (such as here).
@Tai-Wang H100 80GB. It only occurs sometimes. Does memory usage sometimes shoot up above 80GB?
OK, it is a little strange. We use A100 to train the model but observe about ~30G memory usage typically. You may try to reduce the batch size though to check whether it is caused by OOM first.
When I ran it, I encountered the same problem, but I encountered it on the 6 epoch.
04/02 05:42:05 - mmengine - INFO - Epoch(train) [6][150/501] base_lr: 5.0000e-04 lr: 5.0000e-04 eta: -1 day, 21:18:02 time: 3.6373 data_time: 0.4038 memory: 29048 grad_norm: 19.5125 loss: 8.7404 loss_cls: 0.9937 loss_bbox: 0.4470 d0.loss_cls: 1.0453 d0.loss_bbox:
0.4450 d1.loss_cls: 1.0229 d1.loss_bbox: 0.4460 d2.loss_cls: 1.0076 d2.loss_bbox: 0.4439 d3.loss_cls: 1.0010 d3.loss_bbox: 0.4465 d4.loss_cls: 0.9956 d4.loss_bbox: 0.4460
Traceback (most recent call last):
File "tools/train.py", line 133, in <module>
main()
File "tools/train.py", line 129, in main
runner.train()
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1777, in train
model = self.train_loop.run() # type: ignore
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 96, in run
self.run_epoch()
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 112, in run_epoch
self.run_iter(idx, data_batch)
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/runner/loops.py", line 128, in run_iter
outputs = self.runner.model.train_step(
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step
losses = self._run_forward(data, mode='loss')
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward
results = self(**data, mode=mode)
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 666, in forward
return self.loss(inputs, data_samples, **kwargs)
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/detectors/sparse_featfusion_grounder.py", line 507, in loss
losses = self.bbox_head.loss(**head_inputs_dict,
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 637, in loss
losses = self.loss_by_feat(*loss_inputs)
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 668, in loss_by_feat
losses_cls, losses_bbox = multi_apply(
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 711, in loss_by_feat_single
cls_reg_targets = self.get_targets(cls_scores_list,
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 258, in get_targets
pos_inds_list, neg_inds_list) = multi_apply(self._get_targets_single,
File "/mnt/lustre/huangchenxi/anaconda3/envs/visual/lib/python3.8/site-packages/mmdet/models/utils/misc.py", line 219, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/dense_heads/grounding_head.py", line 398, in _get_targets_single
assign_result = self.assigner.assign(
File "/mnt/petrelfs/huangchenxi/EmbodiedScan/embodiedscan/models/task_modules/assigners/hungarian_assigner.py", line 119, in assign
cost = cost.detach().cpu()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
@mrsempress I also only encounter this in the middle, randomly. Sometimes it can run complete without error, but random crashing is very annoying.
Thanks for your feedback. It might be related to #29 as well. Welcome more feedback regarding such problems from the community. We may collect more cases to analyze the possible reasons.
I also encounter this problem more frequently when using the data info with more complex prompts. There are several solutions that may alleviate this problem:
Such solutions may not completely avoid such problems, but they should work to reduce the frequency of encountering them.
Another trick that I have tried is to reduce the num_queries
, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.
Another trick that I have tried is to reduce the
num_queries
, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.
Will it cause any performance drops?
Another trick that I have tried is to reduce the
num_queries
, for example, to 100. It can also significantly reduce the burden when doing matching and computing the costs.Will it cause any performance drops?
It has limited influence on performance. My AP@0.25 increases, and AP@0.5 decreases slightly with num_queries=100 and max_text_length=512 (vs. our provided baseline num_queries=256 and max_text_length=256).
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
Reproduces the problem - code sample
-
Reproduces the problem - command or script
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py configs/grounding/mv-grounding_8xb12_embodiedscan-vg-9dof.py
Reproduces the problem - error message
Additional information
I sometimes run in to
CUDA error: an illegal memory access was encountered
. Do you happen to know what might be the cause?