-
Hello,
When using attention sink with Qwen-14B, I get the following error: TypeError: 'NoneType' object is not subscriptable
my script as is:
import torch
from transformers import AutoToken…
-
please solve the problem in code
import torch
import uvicorn
import gc
import asyncio
import argparse
import io
from fastapi import FastAPI, WebSocket, Depends
from fastapi.responses …
-
使用onnx_export.py脚本无法导出v2模型:
```shell
python onnx_export.py
```
输出:
```text
G:\GPT-SoVITS\.venv\Lib\site-packages\gradio_client\documentation.py:103: UserWarning: Could not get documentation grou…
-
Any chance you could support [this](https://github.com/mustache/spec/pull/75) proposal?
Mustache.php implemented it [nicely](https://github.com/bobthecow/mustache.php/wiki/BLOCKS-pragma).
-
How can i integrate the lama2 7b model through this streaming llm, the model is already pretrained version, will it work over here?
-
**Is your feature request related to a problem? Please describe.**
We are building the serving solution for DL logic using Pytriton at work. We ourselves would like to separate the client stubs from …
-
### Question Validation
- [X] I have searched both the documentation and discord for an answer.
### Question
i was using 3900 tokens before while using chatmemorybuffer from llamaindex
facing i…
-
server:
```
export CUDA_VISIBLE_DEVICES="3,4,5,6"
python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --tokenizer-path lmms-lab/llavanext-qwen-tokenizer --port=30010 --host="0.0.0.0…
-
### System Info
- GPU: nvidia A30
- TensorRT-LLM: commit [32ed92e](https://github.com/chiendb97/TensorRT-LLM/commit/32ed92e4491baf2d54682a21d247e1948cca996e)
- Nvidia driver: 535.86.10
- Ubuntu 22.04…
-
i have use the `nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3` docker image and i use the engine built in a TensorRT-LLM container `tensorrt_llm/release:latest` by
```
python build.py --model_…