-
Dropping the prompt from the model output is necessary to be able to correctly retrieve an output from the Prediction object, but this is only done in HFModel when a ValueException is thrown upon mode…
-
Hi,
Is there a way to change the frequency_penality or logit bias when sending a completion request?
-
Hi,
I was trying to use command below to download Vardict-1.8.3 but get html files instead of zip files:
wget https://github.com/AstraZeneca-NGS/VarDictJava/releases/tag/v1.8.3/VarDict-1.8.3.tar
…
-
I have 8 Tesla V100 32 GB GPUS, and set tensor_parallel_size tp 8, which should be enough to run meta-llama/Llama-2-70b-chat-hf but I am getting an
```
RuntimeError: CUDA error: out of memory
CUDA …
-
I built and install a custom API at https://7b80-103-253-89-37.ngrok-free.app/api/generate
Everything works fine.
But when change the endpoint and config template in LLM VSC into this endpoi…
-
I started a `inf2.48xlarge` ec2, pull and get into [TGI-Neuron DLC with optimum-neuron 0.0.17 installed](https://github.com/aws/deep-learning-containers/releases/tag/v1.0-hf-tgi-0.0.17-pt-1.13.1-inf-n…
-
Hi, thanks for publishing this example.
With Mixtral + TGI, is it actually required to fit the full model in VRAM? Or, is it possible to opt for 100GB+ of system memory with lower GPU capacity?
…
-
We are using Triton Inference Server for model inference and currently facing throughput bottlenecks with LLM inference. I saw in a public video that Nvidia has optimized LLM serving by supporting `In…
-
➜ aiac --version
aiac version 5.2.1
We are using local backends provided by huggingface TGI
```yaml
[backends.phi3]
type = "openai"
default_model = "Phi-3"
url = "https://phi3.ourcluster/…
-
Docker启动GPU推理模式,可以使用INT4量化后的模型吗,启动好像出错了。