-
### Self Checks
- [X] This template is only for bug reports. For questions, please visit [Discussions](https://github.com/fishaudio/fish-speech/discussions).
- [X] I have thoroughly reviewed the proj…
-
!pip install -U airllm
!pip install -U bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/ac…
-
I am using wsl. i have cuda 12.4 installed. I did: `pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation`
command ` insanely-fast-whisper --model-name "openai/whisper-large-v3…
-
I noticed that the current integration uses Faster Whisper for transcription. I would like to suggest replacing it with “Insanely Fast Whisper” for improved performance, especially in GPU-based enviro…
-
### 🚀 The feature, motivation and pitch
Right now, the used SDPA backend is selected by a static priority order list:
``` python
std::array priority_order(sdp_params const& params) {
constexpr…
-
### System Info
Hi!
I'm running speculative execution TRT-LLM engine with 4 or 5 generation length, and I noticed that fp8 kv cache attention works slower than fp16 kv cache attention. Would be grea…
-
I have just read your paper "LINA-SPEECH: GATED LINEAR ATTENTION IS A FAST AND PARAMETER-EFFICIENT LEARNER FOR TEXT-TO-SPEECH SYNTHESIS" and I must say, I am truly amazed by the effectiveness of your …
-
From my understanding, flex attention (using `block_mask`) gets faster when the number of empty blocks is larger. If the inputs (Q, K, V) do not represent sequences, but graphs with local connectivity…
-
Hello lyuwenyu,
First of all, thank you for your amazing work on RT-DETR! I’ve just started learning about object detection models, and I truly appreciate the innovations that make RT-DETR both fas…
-
### System Info
DGX H100
### Who can help?
when build engine with :
```sh
trtllm-build --fast-build --model_config $model_cfg
```
and then benchmark with gptMangerBenchmark, it repo…