OpenBMB / MiniCPM-V

MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone
Apache License 2.0
7.82k stars 543 forks source link

Clear the torch cuda cache after response #301

Open RandomGitUser321 opened 1 week ago

RandomGitUser321 commented 1 week ago

If a user is using 2.5 with int4, it can just barely fit into 8gb of vram (an extremely common vram size) without using any shared memory. If you mess with the settings and switch from sampling to beam search mode, with the default settings, it will cause the GPU to use more than 8gb of vram and roll over into the shared pool, which exponentially slows things down.

If the user then tries to go back to using sampling mode, to regain the lost speed, the vram usage will still contain the leftovers from using beam and will still be slowed down.

This PR just purges the cuda cache after response, so if you change between the settings, you don't get stuck with the garbage in vram that will keep you stuck with a reduced speed.

EDIT: Updated to only run the command if device == "cuda"