kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

RasaHQ/rasa-calm-demo #35

Integration of Llama with Rasa Pro

I am geeting the following for Llama-LLM ```bash 2024-06-28 21:57:20 INFO openai - message='OpenAI API response' path=https://api.openai.com/v1/embeddings processing_ms=15 request_id=req_5533…

hamzaziizzz updated 2 months ago
2
pytorch/pytorch #121857

AdamW(fused=True) slower than unfused AdamW

### 🐛 Describe the bug 512M parameters Mostly vanilla LM transformer. FlashAttention 2.4.2, PyTorch 2.2.0. Uses both FA and FlashRotary. Dtype: bf16 Nvidia A40. single-GPU Unfused: 85 TFLOPS F…

ad8e updated 7 months ago
21
vllm-project/vllm #10025

[Bug]: Error while trying to run vLLM microsoft/Phi-3-mini-4…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... WARNING 11-05 06:10:50 _custom_ops.py:19] Failed to import from vllm._C with Mo…

sudipto-g updated 3 weeks ago
3
pytorch/pytorch #34573

Restructure `multi_head_attention_forward`

## 🚀 Feature Restructure the function `multi_head_attention_forward` in [nn.functional](https://github.com/pytorch/pytorch/blob/23b2fba79a6d2baadbb528b58ce6adb0ea929976/torch/nn/functional.py#L357…

Enealor updated 6 days ago
31
ollama/ollama #6902

No ollama model can recognize the referenced information.

### What is the issue? Scene One By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differe…

SDAIer updated 2 months ago
9
ollama/ollama #7568

CUDA error: unspecified launch failure in function ggml_back…

### What is the issue? I am using open webUI version v0.3.30 and when I try to analyze an image using the llama3.2-vision:latest model I get nothing. In the ollama service log I see the following: …

romansvet updated 3 weeks ago
3
vllm-project/vllm #7303

[Bug]: vllm hangs after model download / load

### Your current environment ```text The output of `python collect_env.py` ``` ### 🐛 Describe the bug ### On the Tesla T4 the model "hangs" after loading the model (the vram usage spikes normal…

ArtificialEU updated 1 month ago
6
sgl-project/sglang #1362

[Bug] sgLang v0.3 breaks TP8 Llama 3.1 405B FP8 on 8xH100

### Checklist - [x] 1. I have searched related issues but cannot get the expected help. - [x] 2. The bug has not been fixed in the latest version. - [x] 3. Please note that if the bug-related iss…

jischein updated 2 months ago
9
vllm-project/vllm #8948

[Bug]: AsyncLLMEngine CUDA runtime error 'device-side assert…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch…

Ouna-the-Dataweaver updated 1 month ago
6
intel-analytics/ipex-llm #12228

docker container cannot run Qwen2.5 32b awq int4 quantizatio…

docker container version ： ipex-llm-serving-xpu:2.2.0-b2 start shell script： model="/llm/models/Qwen/Qwen2.5-32B-Instruct-AWQ" served_model_name="Qwen2.5-32B-Instruct-AWQ" export CCL_WORKER_…

sarsmlee updated 1 month ago
1

上一页 1...93 94 95 96 97 98 99...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization