Phenomenon description: In some cases, the code completion function experiences some lag and does not return completion content for a long time, sometimes for more than 10 seconds.
Known information: The llama-server process's CPU usage can reach 100%, while the GPU average utilization rate is around 20%, with a peak of no more than 40%.
Machine configuration: Single GPU, NVIDIA V100*8, 64C/256G
Are there any optimization deployment configurations to improve performance?
Phenomenon description: In some cases, the code completion function experiences some lag and does not return completion content for a long time, sometimes for more than 10 seconds. Known information: The llama-server process's CPU usage can reach 100%, while the GPU average utilization rate is around 20%, with a peak of no more than 40%. Machine configuration: Single GPU, NVIDIA V100*8, 64C/256G
Are there any optimization deployment configurations to improve performance?
thanks!