Improving Long-Form Generation with Customizable Chunking Methods and Output Verification

Is your feature request related to a problem? Please describe. The current chunking method can split text into parts that are too long for the model, leading to reduced quality, skipped words, or hallucinations.

Describe the solution you'd like -It would be greatly appreciated if additional chunking methods could be provided, such as splitting on every sentence, comma, or other customizable delimiters. -Have an output verification feature. This could involve displaying each chunk's generated audio below the corresponding text, allowing us to listen to and review each part individually. This would enable us to identify and re-generate any problematic parts, ensuring the final output quality. If a particular chunk's audio sounds unnatural or contains errors, we could re-generate just that chunk, rather than having to re-generate the entire output. -Consider using pre-generated reference voice embeddings to accelerate regeneration times when re-generating specific parts of the output. -Before merging the split audio parts for the final output, it would be great to have an option to insert X second/millisecond silence between parts.

Describe alternatives you've considered One possible alternative is to use batch generation with text files or custom formatting, which could address the issues mentioned above, although it would require more manual effort from users.

Additional context Thank you so much for your hard work!

FunAudioLLM / CosyVoice

Improving Long-Form Generation with Customizable Chunking Methods and Output Verification #599