kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

YaoJiayi/CacheBlend #10

FlashAttention error

Hello, I met the following error when running `blend.py` I changed the model to `Llama-3-8B-Instruct` since I have no access to mixtral models. Will that cause error? Log: ``` $ python example…

peng1999 updated 2 months ago
2
pytorch/ao #493

[Tracker] WIP features for torchao 0.4

Release date: Aug 8 2024 Branch cut: Aug 2 2024 ## [Developer Facing API](https://github.com/pytorch/ao/issues/391) - [x] static quantization flow example @jerryzh168 - [ ] QAT refactor to gener…

jerryzh168 updated 3 months ago
1
vllm-project/vllm #8400

[Bug]: Pixtral leads to Expected at least 18286 dummy tokens…

### Your current environment H100 40GB ### Model Input Dumps _No response_ ### 🐛 Describe the bug ``` docker run -d --restart=always \ --runtime=nvidia \ --gpus '"device=MIG-2ea01c20-8…

pseudotensor updated 2 months ago
22
vllm-project/vllm #8732

[Bug]: TypeError: 'NoneType' object is not subscriptable RP…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... WARNING 09-23 09:07:16 _custom_ops.py:18] Failed to import from vllm._C with …

vikyw89 updated 1 month ago
4
intel-analytics/ipex-llm #11080

ipex-llm[cpp] error: Sub-group size 8 is not supported on th…

I followed the instructions from https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html on a bare metal server from the Intel Dev Cloud, specifically this instance: …

player1537 updated 1 month ago
5
NVIDIA/TensorRT-LLM #2225

LLM configured host_cache_size is invalid

Hi, I'm testing llama3-70b model with smoothquant on a 4 x RTX-4090 GPUs node. Due to the memory restriction, I used `host_cache_size` parameter for offloading kv cache to host. Then I hit 2 issues:…

ljayx updated 1 month ago
2
sgl-project/sglang #1329

[Bug] Using 8 H20 GPUs, the deepseek-coder-v2-fp8 starts up …

### Checklist - [X] 1. I have searched related issues but cannot get the expected help. - [X] 2. The bug has not been fixed in the latest version. - [X] 3. Please note that if the bug-related issue y…

fengyang95 updated 2 months ago
4
vllm-project/vllm #3808

[Bug]: AssertionError: libcuda.so cannot found with vllm/vll…

### Your current environment Running in Kubernetes on H100 in vllm/vllm-openai:v0.4.0 ### 🐛 Describe the bug Seems like there have been some weird dependency issues since v0.2.7. We would love to u…

noahshpak updated 1 week ago
7
ggerganov/llama.cpp #9535

Bug: llama-cli generates incoherent output with full gpu off…

### What happened? Offloading 31 layers out of the 33 with an 8b model produces correct results, with 32 layers, the response is incoherent. 33 or more offloaded layers cause the instruction to be…

8XXD8 updated 2 months ago
3
vllm-project/vllm #7464

[Bug]: Gemma-2-2b-it load model hangs by vLLM==0.5.1 on Tesl…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch…

wlwqq updated 1 month ago
11

上一页 1...91 92 93 94 95 96 97...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization