-
Hi there :-)
Is there a possibility to configure multiple users / concurrent request sessions?
I'd like to simulate how the different backends behave if not 1 user, but e.g. 8 users concurrently a…
-
**Is your feature request related to a problem? Please describe.**
Modules that process spatio-temporal data often use date and time input. It would be useful to have some standard parser options for…
-
I have 2 2070 supers and would love to be able to use them in parallel. Would be possible to enable memory pooling. I know it is in theory supported by pytorch. Any chance it can be added here so that…
-
Based on watsonx requirements, we should make available these metrics, at least:
- '# of inference requests over defined time period
- Avg. response time over defined time period
- '# of successf…
-
I'm trying to deploy Llama3 8b on GKE using optimum but running into some troubles.
Following instructions here: https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference. I bu…
-
Request
The ask is to introduce a openai text generation API compatibility layer (chat completion endpoint) to kserve/TGIS.
Why
Having an openai API compatibility layer will allow more open sourc…
-
Hi guys love your project
I was wondering if you can add support to mistral via:
- [TGI](https://github.com/huggingface/text-generation-inference)
- [vllm](https://github.com/vllm-project/vllm)…
-
I am trying to run TGI on Docker using 8 GPUs with 16GB each, using the following command:
docker run --gpus all --name tgi --shm-size 1g --cpus="5.0" --rm --runtime=nvidia -e HUGGING_FACE_HUB_TOKEN=…
-
需求背景:
TGI适配lightllm,多卡加载模型的时候,用到几张卡就会有几个进程,并且每个进程都会完整的加载整个模型到内存中来。
当模型文件太大,比如65B以上的模型,使用8卡加载的话就会需要8*130G的内存,这显然是不合理的,会导致OOM。
解决办法:
可在lightllm中帮忙提供load_from_weight_dict(weight_dict) 接口。TGI层传入权重词典…
-
Hello,
I have a space that used to work until yesterday, it is created with the standard chat-ui docker image. Since today when it is built, I have the following error:
```
--> RUN npm run build
…
lccnl updated
7 months ago