IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.9k stars 199 forks source link

Strange behavior of detrex with respect to GPU memory overflow. #347

Closed ujjwalnur closed 1 month ago

ujjwalnur commented 1 month ago

Hello,

I am running the training over COCO dataset using a slightly modified version of DETR. To put my question simply :

Is it possible for the GPU utilization to remain nominal and everything running perfectly fine and suddenly OOM issue being faced ?

To specify additional details, when I start the training, I see good GPU core utilization and nominal GPU memory utilization. I am running the codes on GPU IDs 0,1 and 3 ( so please ignore the output of GPU02 ).

Screenshot from 2024-05-31 09-07-06

Then all of a sudden at a point, there is an OOM error :

Screenshot from 2024-05-31 09-08-05

I do not understand that why suddenly 264 MB of allocation becomes a problem ? There is no indication of it in nvtop graphs. For several iterations, the model was working fine and there was no indication that GPU memory was getting full.

Could you help with this ?