OpenMOSS / AnyGPT

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
732 stars 56 forks source link

is there any way to reduce latency? #29

Closed kaen2891 closed 1 month ago

kaen2891 commented 1 month ago

Thank you for sharing your work.

When I tried to reduce the max_token_len to 100 or 200 (default is 500), it was not enough to include all the generated speech tokens, thus we cannot synthesize the waveform.

Moreover, generating speech tokens requires a significant amount of time, which can lead to slow evaluations. If we want to reduce the latency time for generating speech tokens, which part do you think we should modify? It would be great if you provide the links to the code.

JunZhan2000 commented 1 month ago

Hello, we have not tried this, but in theory the solution should be very mature, because we directly use the llama2 architecture, so you should be able to directly use the llama2 quantization or compression method

kaen2891 commented 1 month ago

Thank you for replying. If we want to generate only text tokens (ignoring the speech token from AnyGPT), how can we do?

JunZhan2000 commented 1 month ago

It is very simple, because we also have plain text dialogue data training during training. This type of data has a specific prompt before the start. You can use the prompt to have a dialogue. Refer to https://github.com/OpenMOSS/AnyGPT/blob/main/anygpt/src/m_utils/prompter.py#L19