-
### System Info
CPU Architecture: x86_64
CPU/Host memory size: 1024Gi (1.0Ti)
GPU properties:
GPU name: NVIDIA GeForce RTX 4090
GPU mem size: 24Gb…
-
### What happened?
I am using Llama.cpp + SYCL to perform inference on a multiple GPU server. However, I get a Segmentation Fault when using multiple GPUs. The same model can produce inference output…
-
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I…
-
I just upgraded to the latest ollama to verify the issue and it it still present on my hardware
I am running version 0.1.25 and trying to run the falcon model
Warning: could not connect to a ru…
-
Hi,
i encounter the following error message trying to enable flash attention when running the command below. Can i know is flash attention supported ?
``command: ./main -m $model -n 128 --prompt …
-
I've been using unstructured for a while in a 100% cpu machine. I've noticed a lot of nvidia files (+2gb) in my venv folder coming from PyTorch (possible one of unstructured's dependencies).
Can I in…
-
I was trying to migrate from MLC-LLM to onnxruntime to run Phi-3 on an Orange Pi 5 but I realize that among ALL your execution providers there isn't a single one that takes advantage of the GPU or NPU…
-
Trying to do inference on arc GPU machine, have followed this guidelines:
```
https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Pipeline-Parallel-Inference
and run_mi…
-
### System Info
- CPU architecture: x86_64
- GPU properties
- GPU name: NVIDIA A100
- GPU memory size: 40G
- Libraries
- TensorRT-LLM branch or tag: main
- TensorRT-LLM commit: 5d8ca2…
-
### What is the issue?
After running for a while, the model still returns gibberish:
```
[12:59:39] [INFO] [Part of Speech Determination] [Fixed] JSON string: Since you did not provide specific con…