-
I only modified t6 instead of t4, t4 t5 both work well for this model,but if we set the thread=6,will always trigger the problem on my XIAOMI14Pro(SM8650 8Gen3)
please check it for resolve
thanks~
…
-
I would like to use this library for in-browser web ml inference because with the upcoming CPU support it is better than
1. ggml.cpp(llama.cpp/whisper.cpp) - as it supports both CPU and GPU and can u…
-
llama_model_loader: loaded meta data with 32 key-value pairs and 219 tensors from /data/huggingface/hub/models--city96--t5-v1_1-xxl-encoder-gguf/snapshots/005a6ea51a7d0b84d677b3e633bb52a8c85a83d9/./t5…
-
### Your current environment
```text
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NV…
-
I just test using only cpu to lanch LLMs,however it only takes 4cpu busy 100% of the vmware, others still 0%
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
### 🚀 The feature, motivation and pitch
Hi Pytorch maintainers,
I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…
-
### Your current environment
Collecting environment information...
/home/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error …
-
### What is the issue?
I have deployed ollama using the docker image 0.3.10. Loading "big" models fails.
llama3.1 and other "small" models (e.g. codestral) fits into one GPU and works fine. llama3.1…
-
### What happened?
If you pass `tfs_z` param to the server, it crashes sometimes.
Starting the server:
```
~/test/llama.cpp/llama-server -m /opt/models/text/gemma-2-27b-it-Q8_0.gguf --verbose
`…