-
### What is the issue?
I am using open webUI version v0.3.30 and when I try to analyze an image using the llama3.2-vision:latest model I get nothing.
In the ollama service log I see the following:
…
-
Could you guys share rough timeline on the support of FP8 quantization for Mixtral (MoE) model?
cc: @Tracin
-
训练命令如下:
CUDA_VISIBLE_DEVICES=0,1 python train.py
报错信息如下:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /chatglm2-dev/train.py:122 in …
-
### Motivation
This library https://github.com/mit-han-lab/qserve introduces W4A8KV4 Quantization method, called (https://arxiv.org/abs/2405.04532) as QoQ in the paper, which **delivers performance g…
-
### Checklist
- [X] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.…
-
### Describe the bug
Inference fails after prompt evaluation with llama-cpp backend with error:
```
CUDA error: invalid argument
current device: 1, in function ggml_backend_cuda_graph_compute …
-
### What is the issue?
Scene One
By calling a public cloud-based LLM model through an AI Agent, two documents exceeding 2000 words each are uploaded, and the input question is: Analyze the differe…
-
### Discussed in https://github.com/ggerganov/llama.cpp/discussions/9228
Originally posted by **bulaikexiansheng** August 29, 2024
I try to use the speculative decoding script, the command is …
-
### What happened?
When running:
.\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --host 0.0.0.0 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --promp…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
WARNING 11-05 06:10:50 _custom_ops.py:19] Failed to import from vllm._C with Mo…