These one click templates allow you to quickly boot up an API for a given language model.
Advanced inferencing scripts (incl. for function calling) are available for purchase here.
Note: vLLM runs into issues sometimes if the pod template does not have the correct CUDA drivers. Unfortunately there is no way to know when picking a GPU. An issue has been raised here. As an alternative, you can run TGI (and even query in openai style, guide here). TGI is faster than vLLM and recommended in general.
To support the Trelis Research YouTube channel, you can sign up for an account with this link. Trelis is also supported by a 1-2% commission by your use of one-click templates.
--quantize eetq
to run with under 15 GB of VRAM (e.g. A4000).Note: The vLLM image has compatibility issues with certain CUDA drivers, leading to issues on certain pods. A6000 Ada is typically an option that works.
Post a new issue if you would like other templates
To support the Trelis Research YouTube channel, you can sign up for an account with this affiliate link. Trelis is also supported by a 1-2% commission by your use of one-click templates.
One-click templates for function-calling are located on the HuggingFace model cards. Check out the collection here.
Feb 16 2023:
Jan 21 2023:
Jan 9 2023:
Dec 30 2023:
Dec 29 2023: