Memory error when I start training

RangiLyu / nanodet

NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥

Apache License 2.0

5.63k stars 1.03k forks source link

Memory error when I start training #527

Closed nijatmursali closed 9 months ago

nijatmursali commented 10 months ago

Hello,

I have my COCO and VOC datasets on my local and I installed this repository and could run the demo in my local.

I have COCO dataset like:

coco:
  - annotations
        - images:
            - train2017
            - val2017

When I run the train file like

python tools/train.py config/nanodet-plus-m_320.yml

it gives:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 6.00 GiB total capacity; 6.48 GiB already allocated
; 0 bytes free; 6.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Also, where does the checkpoint file goes (which folder) once I train the model?

mirgunde commented 10 months ago

check batch size in config

Sanath1998 commented 10 months ago

Hi @nijatmursali ,

Just try decreasing the batch size in .yml file device: gpu_ids: [0] workers_per_gpu: 10 batchsize_per_gpu: 4 precision: 32 # set to 16 to use AMP training

U need to change the batchsize_per_gpu as per your convenience.

nijatmursali commented 10 months ago

Thank you, both for mentioning. I played with parameters and seems 32 batch_size is good for my system, but training takes quite a lot of time for epochs.

My system is RTX3060 6GB with 40GB ram. I had to put workers to 4 (but my system has 12, but it gives memory error).

Is there any trained model I can download (either Weight or Checkpoint) with 300 epochs? I'm trying to work on my Thesis project, but can't train the model on local.

mirgunde commented 10 months ago

@nijatmursali you can utilise worker_per_gpu=12. Just do a trail training with batch_size=1 and epochs=50

If that works then you can train for all 300 epochs.

nijatmursali commented 10 months ago

Is there any trained model I can download (Checkpoint) with 300 epochs?

mirgunde commented 10 months ago

You can check docs of nanodet

Sanath1998 commented 10 months ago

Is there any trained model I can download (Checkpoint) with 300 epochs?

Are u aware of how to save all the checkpoints of each epochs and what epochs does model.best saves? Where to check those?