Open ExtReMLapin opened 7 months ago
I'm trying to debug it, I added print debug code inside paste_mask
from the stack
print('x0_int, y0_int, x1_int, y1_int:', x0_int, y0_int, x1_int, y1_int)
And it prints :
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
2848)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
2000) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
3000)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
2848)
x0_int, y0_int, x1_int, y1_int: 0 0 Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
4288) Tensor(shape=[], dtype=int32, place=Place(gpu:0), stop_gradient=True,
2848)
before crashing, shouldn't the resolution be caped down because of the EvalReader
Resize
?
Alright, after debugging, a little more I feel like the image is correctly capped when sent to gpu but later scalled back up, which is weird and fills gpu memory.
I can now confirm it's caused by the raw images resolutions, and it's probably trying to scale up the predicted mask to the original image resolution.
After resizing all my dataset images to 1333, issue is gone. Still a bug tho.
You can try adding padding. device. cuda. empty_cache(). after each epoch training
You misunderstand the issue, it's the validation process ITSELF that causes the OOM error.
With a batch size of 1 it should NOT use more than 1 gig of VRAM and here, because of the original resolution (even scalled down with augmentations) of the validation images, it OOM. Padding won't help as it add pixels.
After detecting on scaled down resolution, for some reasons, the mask is scalled UP to the original image resolution, and this is where it CUDA OOM.
问题确认 Search before asking
Bug组件 Bug Component
Validation
Bug描述 Describe the Bug
Training on detection works fine, with my own dataset, however, when switching from
cascade_rcnn_r50_vd_fpn_ssld_2x
tocascade_mask_rcnn_r50_vd_fpn_ssld_2x
it CUDA OOM on first evalAll the batches are set to the lowest possible to reduce vram usage, it uses only 07/40 GB of VRAM during training.
When first epochs finishes vram usage goes from ~7Gb to 20-38Gb and then CUDA OOM
cascade_mask_fpn_reader.yml
cascade_mask_rcnn_r50_vd_fpn_ssld_2x_fp_zones.yml
复现环境 Environment
paddlepaddle-gpu==2.5.2
Bug描述确认 Bug description confirmation
是否愿意提交PR? Are you willing to submit a PR?