If a user is using 2.5 with int4, it can just barely fit into 8gb of vram (an extremely common vram size) without using any shared memory. If you mess with the settings and switch from sampling to beam search mode, with the default settings, it will cause the GPU to use more than 8gb of vram and roll over into the shared pool, which exponentially slows things down.
If the user then tries to go back to using sampling mode, to regain the lost speed, the vram usage will still contain the leftovers from using beam and will still be slowed down.
This PR just purges the cuda cache after response, so if you change between the settings, you don't get stuck with the garbage in vram that will keep you stuck with a reduced speed.
EDIT: Updated to only run the command if device == "cuda"
If a user is using 2.5 with int4, it can just barely fit into 8gb of vram (an extremely common vram size) without using any shared memory. If you mess with the settings and switch from sampling to beam search mode, with the default settings, it will cause the GPU to use more than 8gb of vram and roll over into the shared pool, which exponentially slows things down.
If the user then tries to go back to using sampling mode, to regain the lost speed, the vram usage will still contain the leftovers from using beam and will still be slowed down.
This PR just purges the cuda cache after response, so if you change between the settings, you don't get stuck with the garbage in vram that will keep you stuck with a reduced speed.
EDIT: Updated to only run the command if device == "cuda"