-
When I try to serve LLaMA with `v3_8` TPU as suggested in example script, there were some errors.
Environment
* TPU: `v3-8`
* Software: `tpu-vm-base`
**Command**
```
$ git clone https:…
-
## Describe the bug
Have a look :-)
https://github.com/user-attachments/assets/321dbb21-2403-4330-9ce1-091902298888
## Latest commit or version
0.22
MBP M3 Max
-
**Is your feature request related to a problem? Please describe.**
Currently, the tokenized string is shorter than max_length, output is be padded with 0s. So if `max( tokenized string lengths)` <…
-
Is it possible to stream each token of the output as soon as it is generated by the model? I guess it depends on the hugging face transformers classes and methods. Any solution to this?
-
how to add eos token
-
## Brief Overview
Downloading, saving, and preprocessing large datasets from the `datasets` library can often result in [performance bottlenecks](https://github.com/huggingface/datasets/issues/3735).…
-
I deployed the converted starcoder model to Triton with a world size of 2 and enabled streaming inference with streaming=True. However, I encountered an issue where the rank 1 model is unable to retri…
-
please solve the problem in code
import torch
import uvicorn
import gc
import asyncio
import argparse
import io
from fastapi import FastAPI, WebSocket, Depends
from fastapi.responses …
-
### System Info
CPU: x86_64, memory: 1024GB, GPU: 8*A6000 48GB each, Tensorrt-LLM version 0.9.0.DEV20240226. NVIDIA-Driver Version: 535.171.04 CUDA Version: 12.2; OS - Ubuntu 22.04
### Who can hel…
-
Hi, I am interested in evaluating OpenChat (https://github.com/evalplus/evalplus/issues/60, https://github.com/evalplus/evalplus/issues/61) and want to understand what could be a minimal and self-cont…