Open Bhuvanesh09 opened 2 weeks ago
@Tracin Could you please take a look? Thanks.
BTW, @Bhuvanesh09 Could you please try the main branch? Or the 0.10.0 release branch?
@Bhuvanesh09 I think kv_cache_reuse is orthogonal to AWQ quantization. For narrow down the issue, could you try with a full-precision model?
System Info
Who can help?
@Tracin , @kaiyux , @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Quantized with: the command:
Started the model with arguments:
How to get the error:
When tested with a semaphore of 10(Ensuring 10 requests are always pending at the server), after a few successful predictions, we get the error:
Information which might help in debugging:
The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when
enable_kv_cache_reuse
is set to off.Expected behavior
The Model should continue to serve the requests without any issues.
actual behavior
We get the following error in the triton server:
additional notes
Information which might help in debugging:
The requests get dropped and the server stops working only when the initial set of kv cache in its entirety is full. The server is unable to kickout the LRU kv_cache in paged attention as it is supposed to do. This can be confirmed by the fact that the server runs without any issues when
enable_kv_cache_reuse
is set to off.