NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

LLM in TTS #2342

Open CallmeZhangChenchen opened 1 month ago

CallmeZhangChenchen commented 1 month ago

https://github.com/FunAudioLLM/CosyVoice/blob/main/cosyvoice/llm/llm.py

for i in range(max_len):
            y_pred, att_cache, cnn_cache = self.llm.forward_chunk(lm_input, offset=offset, required_cache_size=-1,
                                                                  att_cache=att_cache, cnn_cache=cnn_cache,
                                                                  att_mask=torch.tril(torch.ones((1, lm_input.shape[1], lm_input.shape[1]),
                                                                                                 device=lm_input.device)).to(torch.bool))
            logp = self.llm_decoder(y_pred[:, -1]).log_softmax(dim=-1)
            top_ids = self.sampling_ids(logp.squeeze(dim=0), out_tokens, sampling, ignore_eos=True if i < min_len else False).item()
            if top_ids == self.speech_token_size:
                break
            # in stream mode, yield token one by one
            yield top_ids
            out_tokens.append(top_ids)
            offset += lm_input.size(1)
            lm_input = self.speech_embedding.weight[top_ids].reshape(1, 1, -1)

In the llm code of TTS like this, Is TensorRT-LLM suitable for use? Are advised to refer to which Demo?

Superjomn commented 1 month ago

We have the LLM API for end-to-end generation, you may have a try. Here are the demos for it.

CallmeZhangChenchen commented 1 month ago

@Superjomn The LLM Api doesn't feel right to use directly, now I'm using the source code to add support for the TTS model, Met a operator ‘RelPositionMultiHeadedAttention’ , and There is no implementation of this in TensoRT-LLM, whether can consider to add

But I'm not sure there's any existing code, set_rel_attn_table, precompute_relative_attention_bias

github-actions[bot] commented 1 day ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."