-
I have been informed that while Flash Attention's there it's not being used -
https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-2031180332
The post has a link to what has …
-
### Model description
https://github.com/ModelTC/lightllm/pull/266
Will there be vision llm support in Lorax soon?
### Open source status
- [X] The model implementation is available
- [X] The mo…
-
### How would you like to use vllm
I want to run inference of a [TheBloke/Llama-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ). I don't know how to use it with vllm.
I try t…
-
https://sites.google.com/view/medusa-llm
-
I have faced an error with the VLLM framework when I tried to inferencing an Unsloth fine-tuned LLAMA3-8b model...
### Error:
(venv) ubuntu@ip-192-168-68-10:~/ans/vllm-server$ python -O -u -m vl…
-
Wondering if the statement on the readme is correct "drop-in replacement for Whisper on English speech recognition" - does this mean even large-v2 model is english only? Thanks!
-
I thought of a way to speed up inference by using batches. This assumes that you can run a batch of 2 faster much than you can run 2 passes. So it will work with GPUs with a lot of compute cores or mu…
-
For modules in LLaMA, we currently have:
LLaMA - llama model with lm head
LLaMAStack - the decoder layers + dec_norm
LLaMABlock - Self attention + ff
I am wondering if maybe we should change t…
-
### The model to consider.
I am trying to run docker image of vllm for gemma-2-27B-it, But facing architectures not recognized error.
error:
ValueError: The checkpoint you are trying to load has …
-
Hi, so tried using your[ deployment.yaml](https://github.com/rh-aiservices-bu/llm-on-openshift/blob/main/llm-servers/vllm/gitops/deployment.yaml); however while the single GPU instance works, multi GP…