-
`torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacty of 44.53 GiB of which 15.25 MiB is free. Including non-PyTorch memory, this process has 44.51 G…
-
/kind feature
**Describe the solution you'd like**
[A clear and concise description of what you want to happen.]
There are different directions:
- extend existing API for referencing multiple …
-
Occasionally encounter errors
```
+ python3 -m vllm.entrypoints.openai.api_server --host xxxxx --port 8003 --served-model-name qwen1.5-72b-chat-int4 --model /home/vllm/model/Qwen1.5-72B-Chat-GPT…
-
## Expected Behavior
When importing a list of devices from the provided CSV template file, the portal should check the type of device authentication before creating it, and only set values fo…
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue y…
-
### 🚀 The feature, motivation and pitch
I have finetuned the linear layers of Pixtral on my own dataset and would like to host the LoRA adapters as is possible for Mistral. It would great if this wou…
-
L:\ComfyUI_windows_portable>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
Total VRAM 12287 MB, total RAM 32726 MB
pytorch version: 2.3.0+cu121
Set vram state to: NORMAL_…
-
WIP project roadmap for LoRAX. We'll continue to update this over time.
# v0.10
- [ ] Speculative decoding adapters
- [ ] AQLM
# v0.11
- [ ] Prefix caching
- [ ] BERT support
- [ ] Embe…
-
python -m lightllm.server.api_server --model_dir /root/autodl-tmp/Qwen2-7B-Instruct --host 0.0.0.0 --port 8000 --trust_remote_code --model_name Qwen2-7B-Instruct --data_type=bfloat16 --eos_id 151…
-
### Your current environment
```text
The output of `python collect_env.py`
```
### How would you like to use vllm
Hi
I want to attach lora using docker command
docker run --runtime nv…