kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #8983

[Bug]: Distributed inference fails on certain multimodal mod…

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N…

suna-123 updated 2 months ago
3
abetlen/llama-cpp-python #687

Memory allocation challenges with LlamaCppEmbeddings on M1 M…

Hi, I am having problems with memory allocation warnings (that lead to crashes) when using LlamaCppEmbeddings on an M1 Mac. I am running llama-cpp-python v0.1.84 on a MacBook Pro with 16GB of RAM, wh…

tony352 updated 11 months ago
4
vllm-project/vllm #8721

[Bug]: torch.OutOfMemoryError: CUDA out of memory.

### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: C…

tristandevs updated 2 months ago
5
ggerganov/llama.cpp #9818

Bug: !!Severly Performance Degration when Using llama.cpp to…

### What happened? Hi, When I use llama.cpp to deploy a pruned llama3.1-8b model, a unbearable performance degration appears: We useing a structed pruning method(LLM-Pruner) to prune llama3.1-8b, w…

gudehhh666 updated 4 days ago
4
ggerganov/llama.cpp #9423

[CANN]Bug: CANN run error on OrangePi AI PRO

### What happened? ``` INFO [ main] build info | tid="255085751848992" timestamp=1726024154 build=3726 commit="b34e0234" INFO [ main] system info | tid="255085…

StudyingLover updated 2 months ago
4
vllm-project/vllm #8444

[Bug]: GPU can only load the model once, it gets stuck when …

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A…

hz20091942 updated 2 months ago
10
YaoJiayi/CacheBlend #10

FlashAttention error

Hello, I met the following error when running `blend.py` I changed the model to `Llama-3-8B-Instruct` since I have no access to mixtral models. Will that cause error? Log: ``` $ python example…

peng1999 updated 2 months ago
2
pytorch/ao #493

[Tracker] WIP features for torchao 0.4

Release date: Aug 8 2024 Branch cut: Aug 2 2024 ## [Developer Facing API](https://github.com/pytorch/ao/issues/391) - [x] static quantization flow example @jerryzh168 - [ ] QAT refactor to gener…

jerryzh168 updated 3 months ago
1
vllm-project/vllm #7256

[Bug]: profile_run Inaccurate estimation leads to gpu OutOfM…

### Your current environment ```text PyTorch version: 2.3.0a0+ebedce2 Is debug build: False CUDA used to build PyTorch: 12.3 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) …

izhuhaoran updated 1 month ago
2
intel-analytics/ipex-llm #11080

ipex-llm[cpp] error: Sub-group size 8 is not supported on th…

I followed the instructions from https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html on a bare metal server from the Intel Dev Cloud, specifically this instance: …

player1537 updated 1 month ago
5

上一页 1...89 90 91 92 93 94 95...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization