-
The stack tool cannot support large models with a .pth extension downloaded from Meta. It throws an error during runtime. Does it have to use models downloaded from Hugging Face? Is this setup unreaso…
-
### System Info
Ubuntu, CPU only, Conda, Python 3.10
### Information
- [x] The official example scripts
- [ ] My own modified scripts
### 🐛 Describe the bug
I am running a single node stack with …
-
### What is the issue?
I have deployed ollama using the docker image 0.3.10. Loading "big" models fails.
llama3.1 and other "small" models (e.g. codestral) fits into one GPU and works fine. llama3.1…
-
### Your current environment
```
I'm attempting to run a multi-node, multi-GPU inference setup using vLLM with pipeline parallelism.
However, I'm encountering an error related to the number of a…
-
### System Info
System Info
TGI Docker Image: ghcr.io/huggingface/text-generation-inference:sha-11d7af7-rocm
MODEL: meta-llama/Llama-3.1-405B-Instruct-FP8
Hardware used:
Intel® Xeon® Platinum 8…
Bihan updated
2 weeks ago
-
### Discussed in https://github.com/ggerganov/llama.cpp/discussions/9960
Originally posted by **SteelPh0enix** October 20, 2024
I've been using llama.cpp w/ ROCm 6.1.2 on latest Windows 11 for…
-
### Jan version
0.5.7
### Describe the Bug
Using Jan v0.5.7 on a Mac with an M1 processor, running Llama 3.2 3B instruct q8 via the API. Occasionally, the server stops responding to POST requ…
-
Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would hav…
-
data_url = data_url_from_image("dog.jpg")
print("The obtained data url is", data_url)
iterator = client.inference.chat_completion(
model=model,
messages=[
{
"role": "…
-
Hi, Thanks for your wonderful work.
I am struggling using my lora tuned model.
I conducted following steps
1. finetuning with lora
- Undi95/Meta-Llama-3-8B-Instruct-hf model base
- llama3 …