OpenDriveLab / UniAD

[CVPR 2023 Best Paper Award] Planning-oriented Autonomous Driving
Apache License 2.0
3.39k stars 371 forks source link

GPU memory is not released after training process is stopped #180

Closed thunguyenth closed 4 months ago

thunguyenth commented 5 months ago

Hi, thank you for sharing your work ^^ I have a problem that:

The GPU memory is not released if I forcibly stop the training process (by using Ctrl-C in the terminal)

Config:

Actions:

Step 1. Training the stage 2 on the nuScene dataset _v1.0-mini_ version

  ./tools/uniad_dist_train.sh ./projects/configs/stage2_e2e/base_e2e.py 1

=> The training process is working normally

Step 2. Stop the training after a few iterations of the 1st epoch by using Ctrl-C in the terminal

Step 3. Re-run the training in step 1 => Out of memory ERROR!!!

I checked the GPU state by nvidia-smi as shown in the below screenshot (when the training was already stopped a few minutes ago), the GPU memory used in the training process at Step 1 was not released (17323MiB / 24259MiB).

This issue can be easily solved by releasing the GPU memory manually, but I wonder if this issue happens to everyone or if it just happens in my case? (since I couldn't find a similar issue reported in this repo) and Why is the GPU memory not released even though the training has stopped? I would appreciate it if you could help me clarify this.

Best regards.

image

YTEP-ZHI commented 5 months ago

Hi @thunguyenth. To terminate the process in Linux, you can use the command pkill -9 python.

thunguyenth commented 4 months ago

Thank you, @YTEP-ZHI, for your reply ^^

I would appreciate it if someone could explain why the GPU memory is not released after the training is stopped forcibly.