-
**Describe the bug**
Following #2547 I tried to run the model gpt-neoxt-chat-base-20b, which is a neox-20B derivative I think and I think it should work.
Inference works if the model is loaded the n…
-
I have 4 GPUs and 3 models called small, medium and large. I want to deploy small model on GPU 0, medium model on GPU 1, and large model on GPU 2 and GPU3 with tensor_para_size=2 due to large model is…
-
Steps to reproduce:
1. Download https://huggingface.co/EleutherAI/gpt-neox-20b
2. Convert the model and attempt to use it:
```
$ TMPDIR=/var/tmp ./convert-gptneox-hf-to-gguf.py gpt-neox-20b 1 --ou…
-
**LocalAI version:**
#895
**Environment, CPU architecture, OS, and Version:**
sh-5.2$ uname -a
MSYS_NT-10.0-19045 DESKTOP-S7HQITA 3.4.7-ea781829.x86_64 2023-07-05 12:05 UTC x86_64 Msys
…
-
### System Info
- CPU architecture: x86_64
- CPU/Host memory size: 126G
- GPU properties
- GPU name: L4
- GPU memory size: 24GB
- Libraries
- TensorRT-LLM branch or tag (e.g., main, v0.…
-
### What happened?
When attempting to quantize [Qwen2 7B instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) to IQ2_XS I get the following assert:
```
GGML_ASSERT: ggml-quants.c:12083: gri…
-
**LocalAI version:**
1.22.0
**Environment, CPU architecture, OS, and Version:**
WSL Ubuntu via VSCode
Intel x86 i5-10400
Nvidia GTX 1070
Windows 10 21H1
uname -a output:
Linux DESKTO…
-
I converted Astrid 1b CPU (https://huggingface.co/PAIXAI/Astrid-1B-CPU) to gguf and quantized it. Then i tried to run it using "main -m 1B/ggml-model-q4_1.gguf -n 128" and got this error:
error loa…
-
Hi, thanks for great work!
I want to use your code to build a `PipelineModule` object from LLama2. Here is my code:
```python
def load_model(neox_args):
config = transformers.AutoConfig.…
-
### System Info
- `transformers` version: 4.40.1
- Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.13
- Huggingface_hub version: 0.22.2
- Safetensors ver…