-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4…
-
### Feature request
Fu et al. propose a novel decoding technique that accelerates greedy decoding on Llama 2 and Code-Llama by 1.5-2x across various parameters sizes, without a draft model. This meth…
-
Hello you all keep scratching my head why sometimes I can deploy all on list but stuff I find having issues
anyways this is my logs just trying to use this repo https://huggingface.co/mistralai/Mis…
-
### Your current environment
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (U…
-
### 🚀 The feature, motivation and pitch
I got this error when trying speculative decoding with 2 4090s:
* https://github.com/vllm-project/vllm/issues/4358
And it looks like that was fixed/added…
-
## ❓ General Questions
This is the error I am getting - **TVMError: Check failed: token_tree_parent_ptr[j] == j - verify_start (0 vs. 1) : CPU sampler only supports chain-style draft tokens.**
T…
-
### Context
This task regards enabling tests for **mpt-7b-chat**. You can find more details under openvino_notebooks [LLM chatbot README.md](https://github.com/openvinotoolkit/openvino_notebooks/tree…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N…
-
When I use the demo provided by read.md to run to output_ids = model.generate(**inputs, max_new_tokens=128), an error RuntimeError: Expected all tensors to be on the same device, but found at least tw…
-
Single GPU is OK, System hangs when I use multiple GPUs. Can someone help solve this? Thanks.
python build.py --model_dir meta-llama/Llama-2-7b-chat-hf \
--dtype float16 \
…