This repository is a quick walkthrough on the steps required to deploy large language models on Red Hat OpenShift (OCP). This is an adapted guide inspired by the docs, guides and examples provided on ai-on-openshift.io.
This guide provides the steps applicable for an OCP 4.13.x instance deployed on AWS, however the steps are similar for other cloud providers and on-prem deployment of the OCP platform. A video guide of the environment setup walkthrough is available here. Please note that there are two additional videos describing how to use (experiment) with the LLM to chat with your documentation (described later in this document), as well for adding an UI for the chatbot. These additional video links are provided below in the respective sections of the document.
This guide will deploy the Mistral-7B-Instruct-v0.2 model. Instructions on how to retrieve the model files are provided on the Hugging Face website where the model is hosted.
Hosting an LLM on OCP can be done with or without a GPU. For performance reasons, it is recommended to proceed with a GPU. You may deploy without GPU however the responsiveness of the model will be much slower. This guide assumes, in the first part, that a GPU shall be used (as a requirement from one of the containers deployed from Quay). A second part will describe options for CPU based deployments (also includes a walkthrough video).
The OCP platform requires a number of operators to be installed and available in order to perform the actual deployment as depicted in the below picture (name and version used for this demo):
Install the operators in the following order:
This repo provides the step by step information for quyickly adding a vLLM serving runtime.
In our case, we used a g5.xlarge machine having an A10 GPU to deploy our model. Generally, you need on the host VM the amount of RAM >= the amount of VRAM. In our case however, this condition is not satisfied, as the g5.xlarge machines have 16 GiB of RAM and the A10 card has 24 GiB of VRAM. Therefore, we need to tweak the arguments for the vLLM container deployment and add some parameters that will force the engine to shrink the model (process known as quantization). We need to add:
So the vLLM.yaml
should look like below:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
labels:
opendatahub.io/dashboard: "true"
metadata:
annotations:
openshift.io/display-name: vLLM
name: vllm
spec:
builtInAdapter:
modelLoadingTimeoutMillis: 90000
containers:
- args:
- --model
- /mnt/models/
- --dtype
- float16
- --max-model-len
- "5900"
- --download-dir
- /models-cache
- --port
- "8080"
image: quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.3.1
name: kserve-container
ports:
- containerPort: 8080
name: http1
protocol: TCP
multiModel: false
supportedModelFormats:
- autoSelect: true
name: pytorch
Deploy this YAML inside the Serving Runtimes configurations of RHOAI as described here.
Once you locally downloaded the Mistral model files, you need to place them on the S3 storage that will be used as a data connection inside your RHOAI project. Head over to the RHOAI console and:
.aws
folder the credentials profile for the bucket you created so that you can upload the files using the aws
cli tool.Next, move out (or delete) the git cache files you downloaded (the .gitignore and .git folder from where your model was downloaded) and then upload the model files to the bucket. Make sure you do not upload directly to the root folder, the files need to be placed in folder other than the root, i.e. Mistral-7B-Instruct-v0.2. The command to upload the files should look like:
aws s3 sync --profile=llm-bucket --endpoint=<your OCP S3 route endopoint> ./Mistral-7B-Instruct-v0.2 s3://<llm-bucket-name>/Mistral-7B-Instruct-v0.2/
Note the /Mistral-7B-Instruct-v0.2/ after the name of the bucket.
Next, within the RHOAI dashboard, select your data-science project and deploy a Single Serving instance under 'Models and models and servers sections' of the project using the vLLM instance type you added in the earlier step. When deploying the instance ensure you select GPU and use a custom size deployment and define the limits depending on your available resources. For example, if using a g5.xlarge ensure your memory limits are between 8GiB and 14GiB of RAM and CPU limites between 2 and 3 cores.
Next, the model deployment will automatically kick off from the defined data connection. Please be patient here, it will take some time for the model to deploy and the KServe pods and services to become available (approximately 15-20 minutes). As soon as your model is deployed, you will see the inference endpoint available in the RHOAI dashboard.
Once deployed, you can test your model either directly using some curl
based commands or via code. You can find an example notebook that uses LangChain in this repository.
For a cli quick test, you can issue the following:
curl -k <inference_endpoint_from_RHOAI>/v1/completions -H "Content-Type: application/json" -d '{ \
"model": "/mnt/models/", \
"prompt": "Describe Paris in 100 words or less.", \
"max_tokens": 100, \
"temperature": 0 \
}'
As explained here, in order to use a generic LLM to perform a specific task you can fine tune it (a.k.a retrain), or use RAG (retrieval-augmented generation). While fine-tuning is more effective, it may also be more expensive to perform. Another less expensive way of enhancing the LLM's knowledge is by using RAG which we shall present next.
The video walkthrough is available here.
The procedure is as follows:
In order to use in a controlled and secure way the access to the LLM an interfacing application is recommended that exposes some user friendly interface. An example of such application is provided in the chatbot-ui folder of this repository. Please note that alongside the source code the folder contains also necessary files to build a container and configurations for Red Hat OpenShift to run the application.
NOTE: The video mentions using v1.1 for the hatbot image. Please use v1.2 instead as it appears that a token is required to use a specific model for langchain.embeddings.huggingface.HuggingFaceEmbeddings
and to avoid that we shall use the default instance.
The deployment walkthrough is provided here.
The steps are as follows:
hatbot
service.yaml
to the project to create the service for the deployment.deployment.yaml
to the project. Please note the deployment uses a ready container stored in quay.io and uses version v1.2
of the container. The deployment is scaled down to zero by default, so ensure you scale it up after you verify the environment parameters.Once loaded the UI should look like below:
The UI application provided here is a very basic one nevertheless it provides sufficient context to understand what elements should be present in a real, production ready type of UI.
This was the final part of the GPU based demo. I hope you enjoyed it.
Deploying an LLM using a CPU-only setup is possible, however, there are certain limitations to be considered when doing so, especially when thinking about the speed and responsiveness of the LLM.
Red Hat OpenShift AI comes with several options for CPU only serving for LLMs (Caikit TGIS, TGIS standalone, OpenVino model server). Additionally, other serving runtimes may be added (as presented earlier in the demo, where we added a vLLM serving runtime), such as Ollama and vLLM. To be noted that vLLM requires CPUs with AVX-512 instruction set in order to work.
While Ollama is significantly faster than the other serving runtimes, noteworthy is the aspect that the models packed by Ollama are "altered" from the original ones (published by the model creators), and, most importantly, you can deploy as-is with Ollama whatever exists in the Ollama repository. In other words, to deploy with Ollama a custom model that you specialized with your own data you need to go through a model transformation phase as described on the Ollama website. This limitation does not apply for the other serving runtimes.
The resources provided to configure the cluster with CPU based LLM serving are available in this repository and the video guide on how to perform the deployment is here
This concludes the setup for the CPU based demo.