-
```
`root@ad966f70d032:/workspace/upvllama/VideoLLaMA2#` sh scripts/custom/finetune_lora.sh
[2024-07-08 09:54:08,665] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda …
-
- [x] Extract model costs (per M request token + per M response token + per request + per response) and write into csv reports
- [x] Check other API providers too (@bauersimon knows about that e.g. M…
-
Any help on getting multi gpu support running? vLLM fails to load with tensor_parallel_size=2
-
Congrats on Flash Attention in the latest version, or to be precise, in having your storage limit increased on Pypi.org so you could upload the release that was weeks ago. Here are some benchmarks fo…
-
### Your current environment
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC …
-
**Description:**
I'm encountering an error while trying to merge models using the `merge.py` script. The process loads the models and processes the layers correctly, but when it attempts to save the m…
-
## 🐛 Bug
## To Reproduce
- After running the server, wait for a period of time.
- model: mistral-large-instruct-2407-q4f16_1
- "tensor_parallel_shards": 4,
```
mlcllm) a@aserver:~$ mlc…
Erxl updated
2 months ago
-
### 🐛 Describe the bug
I have a small script to reproduce how a toy model and the following three features lead to an error when combined:
1. torch.compile
2. FSDP1 with cpu offloading
3. PyTorch …
-
### OS
Linux
### GPU Library
CUDA 12.x
### Python version
3.11
### Describe the bug
When running exllamav2's inference_speculative.py example with llama 3.1 8B 2.25bpw as draft and 70B 4.5bpw a…
-