TrelisResearch / one-click-llms

One click templates for inferencing Language Models
71 stars 9 forks source link

Mixtral Instruct AWQ vLLM API #2

Closed csolheim closed 3 months ago

csolheim commented 5 months ago

Mixtral Instruct AWQ vLLM API by Trelis vllm/vllm-openai:latest

Runpod: 1 x A100 80GB 16 vCPU 125 GB RAM 50 GB Disk 150 GB Pod Volume

Container log fills with these errors: 2024-01-23T03:26:45.560602035-05:00 /usr/bin/python3: Error while finding module specification for 'vllm.entrypoints.openai.api_server' (ModuleNotFoundError: No module named 'vllm')

RonanKMcGovern commented 5 months ago

Thanks for the issue. I assume you are referring to the Runpod Mixtral vLLM Template in the one-click-llms repo.

The issue is that Runpod do not have the same CUDA drivers on all GPUs, and vLLM is currently not able to dynamically handle that.

So, sometimes the pod will work, sometimes not. I get the sense that I have a higher success rate with an A6000.

I've now updated the one-click-llms repo to explain this further. I'm pasting the text from that here:

Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general.

In short, I recommend TGI.