-
Hi, I would like to evaluate the following capabilities of BigDL LLM using PySpark offline CPU jobs:
- Generating Embeddings for queries and documents.
- Generating text uses prompts and/or chai…
-
We have seen a significant difference in performance drop with the env created with the latest repo for vllm serving for the neural-chat model as compared to the old env built with the old repo. With …
-
Hi, we have tried to run the speculative inference process on OPT-13B and Llama2-70B-chat, but meet some issues. Specifically, for Llama2-70B-chat , we obtained performance worse than vLLM, which seem…
-
The current version requires an Internet connection to download the models when it is first used after deployment.
Will the future add ways to deploy without the Internet? This will make LibrePhoto…
-
Use a pre-trained summary model to create summaries "offline" on your computer, eliminating API costs. However, researching, implementing, and getting this to work could be challenging. There are vari…
-
I have been running the scripts from [https://docs.vllm.ai/en/latest/models/spec_decode.html](https://docs.vllm.ai/en/latest/models/spec_decode.html ) on how to do speculative decoding with vLLM.
H…
-
I use the multi-LoRA for offline inference:
sql_lora_path = "/home/zyn/models/slot_lora_gd"
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(model="/ho…
-
When JabRef with AI PR is run first, `djl` will download files in background in order to work with embedding models (I guess the PyTorch backend and embedding model).
JabRef already has some issues…
-
First of all, amazing project!
We've started experimenting with the project on an on-premise offline environment, so far it works great!
We need our extensions to send metrics and events to a centra…
-
### Your current environment
```text
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Alibaba Group Enterprise Linux Serv…