Mixtral Instruct AWQ vLLM API

Thanks for the issue. I assume you are referring to the Runpod Mixtral vLLM Template in the one-click-llms repo.

The issue is that Runpod do not have the same CUDA drivers on all GPUs, and vLLM is currently not able to dynamically handle that.

So, sometimes the pod will work, sometimes not. I get the sense that I have a higher success rate with an A6000.

I've now updated the one-click-llms repo to explain this further. I'm pasting the text from that here:

Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general.

In short, I recommend TGI.

TrelisResearch / one-click-llms

Mixtral Instruct AWQ vLLM API #2