-
I'm trying to understand if this could be used with a local llm via llama.cpp in interactive mode. Is this possible? Would very much like to try this out.
-
### Your current environment
docker: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
branch: habana_main
### 🐛 Describe the bug
I attempted to use the off…
-
Hi, Thanks for your wonderful work.
I am struggling using my lora tuned model.
I conducted following steps
1. finetuning with lora
- Undi95/Meta-Llama-3-8B-Instruct-hf model base
- llama3 …
-
Will you consider supporting the llama.cpp server API for inference?
-
### System Info
GPU: NVIDIA A100
Driver Version: 545.23.08
CUDA: 12.3
versions:
https://github.com/NVIDIA/TensorRT-LLM.git (ab49b93718b906030bcec0c817b10ebb373d4179)
https://github.com/triton-…
-
[ ] I checked the [documentation](https://docs.ragas.io/) and related resources and couldn't find an answer to my question.
**Your Question**
> “WARNING:ragas.llms.output_parser:Failed to parse …
-
Hello, FlexFlow team!
Thank you for your outstanding work! I am attempting to reproduce the experimental results from the paper "SpecInfer: Accelerating Generative Large Language Model Serving with…
-
### What happened?
Inference fails with this cryptic error.
This happens with both CPU and Vulkan engines.
What might be causing this?
### Name and Version
llama-cpp-3538
ollama-0.3.4
### W…
-
Hello, I'm not sure if multi GPU is supported yet. I didn't find parameters for tensor parallel, and the "num_device_layers" parameter seems not work. Please let me know if it supports or has plans to…
-
### What happened?
I am trying to run inference on RPC example. When running the llama-cli with rpc feature over a single rpc-server on localhost, the inference throughput is only 1.9 tok/sec for lla…