kv-cache-quantization Search Results

1000+ results
for kv-cache-quantization

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ollama/ollama #7568

CUDA error: unspecified launch failure in function ggml_back…

### What is the issue? I am using open webUI version v0.3.30 and when I try to analyze an image using the llama3.2-vision:latest model I get nothing. In the ollama service log I see the following: …

romansvet updated 3 weeks ago
3
NVIDIA/TensorRT-LLM #1229

When will FP8 be available for Mixtral?

Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model? cc: @Tracin

Pernekhan updated 2 weeks ago
13
ssbuild/chatglm2_finetuning #24

大佬，lora多卡训练报错，帮忙看下

训练命令如下： CUDA_VISIBLE_DEVICES=0,1 python train.py 报错信息如下： ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /chatglm2-dev/train.py:122 in …

fjchung updated 1 year ago
3
InternLM/lmdeploy #1587

[Feature] Support W4A8KV4 Quantization(QServe/QoQ)

### Motivation This library https://github.com/mit-han-lab/qserve introduces W4A8KV4 Quantization method, called (https://arxiv.org/abs/2405.04532) as QoQ in the paper, which **delivers performance g…

wanzhenchn updated 3 months ago
3
sgl-project/sglang #1616

[Feature] GGUF support

### Checklist - [X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.…

remixer-dec updated 2 days ago
11
oobabooga/text-generation-webui #6398

llama-cpp inference - CUDA error

### Describe the bug Inference fails after prompt evaluation with llama-cpp backend with error: ``` CUDA error: invalid argument current device: 1, in function ggml_backend_cuda_graph_compute …

hpnyaggerman updated 2 months ago
1
ollama/ollama #6902

No ollama model can recognize the referenced information.

### What is the issue? Scene One By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differe…

SDAIer updated 2 months ago
9
ggerganov/llama.cpp #9229

Something wrong when i try to use speculative decoding in ll…

### Discussed in https://github.com/ggerganov/llama.cpp/discussions/9228 Originally posted by **bulaikexiansheng** August 29, 2024 I try to use the speculative decoding script, the command is …

bulaikexiansheng updated 1 month ago
5
ggerganov/llama.cpp #9013

Bug: Slow response times with llama.cpp llama-server

### What happened? When running: .\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --host 0.0.0.0 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --promp…

phly95 updated 2 months ago
8
vllm-project/vllm #10025

[Bug]: Error while trying to run vLLM microsoft/Phi-3-mini-4…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... WARNING 11-05 06:10:50 _custom_ops.py:19] Failed to import from vllm._C with Mo…

sudipto-g updated 3 weeks ago
3

上一页 1...92 93 94 95 96 97 98...100 下一页

1000+ results for kv-cache-quantization

1000+ results
for kv-cache-quantization