-
I have the following problem:
`model=Honkware/openchat_8192-GPTQ
`
`text-generation-launcher --model-id $model --num-shard 1 --quantize gptq --port 8080
`
```
Traceback (most recent call las…
-
### What is the issue?
When testing llava-llama3 on an agentic task of interpreting an image and generating an action. I specified the role of the 'environment' as 'environment'. This leads to ollama…
-
### What happened?
Configuring TEs as follows:
```
"text_encoder": {
"train": false,
"learning_rate": 2e-8,
"layer_skip": 0,
"weight_dtype": "FLOAT_32",
"stop_trainin…
-
data_url = data_url_from_image("dog.jpg")
print("The obtained data url is", data_url)
iterator = client.inference.chat_completion(
model=model,
messages=[
{
"role": "…
-
Hi team, QQ: does `lightseq` support the followings,
- Convert HuggingFace BERT/RoBERTa models to `int8` precision directly
- If yes, can the converted model be exported to ONNX format directly?
- …
-
### Your current environment
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --max-model-len 8192 --served-model-name chat-v2.0 --model /workspace/chat-v2.0 --enforce-eager --tensor-paral…
-
### System Info
I am working on the benchmarking suite in vLLM team, and now trying to run TensorRT-LLM for comparison. I am relying on this github repo (https://github.com/neuralmagic/tensorrt-demo)…
-
**Is your feature request related to a problem? Please describe.**
This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an …
-
Currently I'm using llm to generate streaming response, and I found that triton only supports streaming output through the grpc protocol. [https://docs.nvidia.com/deeplearning/triton-inference-server/…
-
## 描述问题
## 复现
1. 您是否已经正常运行我们提供的[教程](https://github.com/PaddlePaddle/PaddleX/tree/develop/tutorials)?
正常
2. 您是否在教程的基础上修改代码内容?还请您提供运行的代码
无
3. 您使用的数据集是?
自己标注的
4. 请提供您出现的报错信息及相关log
[2024/10/1…