-
I am planning to use Triton's Python backend to serve a LLM model in Pytorch; and more specifically, I want to implement token streaming and hence based on the suggestions I read here [https://github.…
-
With the rise of APIs that use server-sent events (SSE) like ChatGPT, it is becoming more and more common to want to load test and measure time-to-first-byte (TTFB).
For example, TTFB can be a prox…
-
Cohere's new Command-R-Plus model reportedly features a 128k context window. However, testing with progressively longer prompts reveals it begins producing nonsensical output (e.g., "\\...") after 819…
-
Use this thread for general discussion and debate regarding the Character Card Spec V2. **Anyone** may freely use this thread to discuss the spec. However, if you are an owner or representative for a …
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
### Describe the bug
我想用 2 张 4090 来部署 Qwen/…
-
GPU name in [SCS Flavor Naming Standard](https://github.com/SovereignCloudStack/standards/blob/main/Standards/scs-0100-v3-flavor-naming.md#optional-gpu-support) need further refinement. The following …
-
I have encountered an issue when attempting to run the `vllm_inference.py` script from the Modal Examples repository. Below are the steps I followed and the error I encountered:
### Steps to Reprod…
-
### Bug Description
After making the flow, and use Playground to test, the API calling will return the last respond from Playground.
### Reproduction
1. Create the flow.
2. Run test in playground.…
-
It seems to me that for now mlc is trying to loading all weight into one gpu card?
After convert_weight/gen_config/compile, it report error when ready to serve:
```
AssertionError: Cannot estimat…
-
### Your current environment
```text
The output of `python collect_env.py`
```
```
:128: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', bu…