Open vanakema opened 2 months ago
Is it possible to deploy LFAI with several models?
Say for instance I want to have Phi, Llama, and Minitron (NVIDIA) variants available as options and switch between any of the models at will.
How might I do this?
@jxtngx there is a not a great way today. You could configure and run multiple vllm backends with different models - as long as they have different names so the API config listener will register two different models. This would still require fairly intimate knowledge of the LeapfrogAI backend to accomplish though
Currently, the "easiest way" to use different models exists in this issue's branch: #835. There is not yet a hot-swap capability.
And if I simply want to use a model other than the default – do I only need to update the zarf.yaml for either vLLM or llama-cpp-python? And then redeploy the cluster & bundle?
And if I simply want to use a model other than the default – do I only need to update the zarf.yaml for either vLLM or llama-cpp-python? And then redeploy the cluster & bundle?
In #854 you only need to modify the zarf-config.yaml
and then perform a package create and package install.
Example of things you might change, in comments. Other params you can do some research on using the vLLM docs or on HuggingFace! The params below are not comprehensive, 1:1 of what is fully available on vLLM or HuggingFace and we are working on building that out.
package:
create:
set:
# x-release-please-start-version
image_version: "0.13.0"
# x-release-please-end
model_repo_id: "TheBloke/Synthia-7B-v2.0-GPTQ" # change this
model_revision: "gptq-4bit-32g-actorder_True" # change this
model_path: "/data/.model/"
name_override: "vllm"
deploy:
set:
# vLLM runtime configuration (usually influenced by .env in local development)
trust_remote_code: "True" # change this (based on whether there is custom layer, config or inferencing code)
tensor_parallel_size: "1" # change this (based on # of GPU instances for sharding, attn_heads, etc.)
enforce_eager: "False"
gpu_memory_utilization: "0.90"
worker_use_ray: "True"
engine_use_ray: "True"
quantization: "None"
load_format: "auto"
# LeapfrogAI SDK runtime configuration (usually influenced by config.yaml in development)
max_context_length: "32768" # change this
stop_tokens: "</s>, <|im_end|>, <|endoftext|>" # change this
prompt_format_chat_system: "SYSTEM: {}\n" # change this
prompt_format_chat_user: "USER: {}\n" # change this
prompt_format_chat_assistant: "ASSISTANT: {}\n" # change this
temperature: "0.1"
top_p: "1.0"
top_k: "0"
repetition_penalty: "1.0"
max_new_tokens: "8192"
# Pod deployment configuration
gpu_limit: "1" # change this (based on tensor parallel size, and # of GPU instances)
gpu_runtime: "nvidia"
pvc_size: "15Gi" # change this (if model is larger or smaller)
pvc_access_mode: "ReadWriteOnce"
pvc_storage_class: "local-path"
@justinthelaw I am on CPU and am using llama-cpp-python.
Should there also be a zarf-config.yaml
added to llama-cpp-python in #854's dev branch so that the two inference packages mirror each other?
Unfortunately, llama.cpp has not been refactored to reflect the changes made in the vLLM package yet. See issue #1098.
In llama.cpp's case, you will need to modify the config.yaml and zarf.yaml.
Describe what should be investigated or refactored
Currently the data rate for injecting a model into the model PVC is very slow (in relative terms). I observed a data rate of about 66mb/s despite all this network traffic occurring within the same machine.
I know that the gzip thread is completely pinned when the data injection step is happening which indicates a bottleneck to me (offical gzip bin also has no multithreading). @Racer159 told me you can turn off the gzip compression for the data injection, so that may be a good first step in at least getting the data transfer speed increased.
Ideally in the future, there would be an opinionated, easy, reliable, fast workflow for uploading a new model to the system.