Plachtaa / seed-vc

zero-shot voice conversion & singing voice conversion with in context learning
GNU General Public License v3.0
260 stars 27 forks source link

生成音频时长问题 #13

Open ji1360677347 opened 1 week ago

ji1360677347 commented 1 week ago

超过30秒的目标音频最长只能生成出30秒的结果,希望能改进,支持更长的结果,感谢作者的付出!

Drakni01 commented 6 days ago

Hello,

First of all, thank you for the recent modifications regarding the audio length. I tested the new feature that allows generating audio longer than 30 seconds, and I noticed that while it works, the quality of the resulting audio seems to degrade when exceeding 30 seconds. The output sounds a bit off, as if the quality drops noticeably after that point.

I was wondering if it might be a good idea to handle the source audio by splitting it into blocks — for example, 30 seconds and then the remainder (e.g., 20 seconds for a 50-second clip). However, I'm not sure if the transition between these blocks would be smooth. On the other hand, the reference audio could continue using the current modification, where the duration can be chosen to run continuously without splitting.

I just wanted to share my experience and feedback in case it's helpful. Thank you again for your great work!

Plachtaa commented 6 days ago

Hello,

First of all, thank you for the recent modifications regarding the audio length. I tested the new feature that allows generating audio longer than 30 seconds, and I noticed that while it works, the quality of the resulting audio seems to degrade when exceeding 30 seconds. The output sounds a bit off, as if the quality drops noticeably after that point.

I was wondering if it might be a good idea to handle the source audio by splitting it into blocks — for example, 30 seconds and then the remainder (e.g., 20 seconds for a 50-second clip). However, I'm not sure if the transition between these blocks would be smooth. On the other hand, the reference audio could continue using the current modification, where the duration can be chosen to run continuously without splitting.

I just wanted to share my experience and feedback in case it's helpful. Thank you again for your great work!

Good point, this is implemented in the latest commit and crossfading is added between chunks for smooth transition. Please check whether it satisfies your need.

Drakni01 commented 5 days ago

Hi, First of all, I’d like to thank you for the recent update. The chunk processing works really well, and the crossfade function is smooth. Great job!

However, I noticed that the maximum length of the chunks I’m getting is around 4 seconds, and I wanted to check if this is the expected behavior, or if the chunks should be closer to 30 seconds. I came across this part of the code:

"max_source_window = max_context_window - mel2.size(2)"

and I’m wondering if this is why my chunks are limited to 4 seconds. If this was intentional, no problem at all! I still think the results in terms of quality are quite good.

I also wanted to mention something I noticed while using Gradio. I noticed that while streaming the audio, it’s not possible to download it directly. I believe the intention behind the current implementation is that each chunk gets exported automatically as it's processed, which works well for streaming longer audios. I saw that this is handled using yield, and it allows the stream to grow as chunks are processed.

However, I was wondering if it would be possible to unify all chunks into a single audio file once the processing is finished, removing the stream and providing the user with the complete audio file. Currently, because it’s streamed chunk by chunk, there are small silences between chunks and the crossfades don’t work well in the streamed version.

That being said, I tested the crossfade function and found it to be very effective. By modifying the code to progressively unify the chunks before passing them to yield, I was able to confirm that the crossfade works smoothly. The resulting MP3 file had a seamless transition between chunks. I’m attaching an image with three spectrograms to illustrate this:

Crossfade Top: The audio captured directly from the Gradio stream. Middle: The individual chunks placed side by side without crossfades (which results in audible clicks between them). Bottom: The final unified audio with the crossfade applied (this worked perfectly and produced a smooth transition).

Lastly, I wanted to ask if it would be possible to output the final unified audio as a WAV file instead of MP3. I noticed that MP3 introduces some quality loss, especially above 10kHz, and I believe a WAV file would better preserve the audio quality.

Thanks again for your work!

AIFSH commented 5 days ago

https://github.com/RVC-Boss/GPT-SoVITS/blob/main/tools%2Fslicer2.py i use this tools to split audio in series vocal in 30s,and work well! you can find demo here,

【SeedVC-ComfyUI绕开原作30秒封印,现已支持整首歌曲变声,整合包在评论区哦!-哔哩哔哩】 https://b23.tv/Z1JZx6m

Plachtaa commented 4 days ago

Hi, First of all, I’d like to thank you for the recent update. The chunk processing works really well, and the crossfade function is smooth. Great job!

However, I noticed that the maximum length of the chunks I’m getting is around 4 seconds, and I wanted to check if this is the expected behavior, or if the chunks should be closer to 30 seconds. I came across this part of the code:

"max_source_window = max_context_window - mel2.size(2)"

and I’m wondering if this is why my chunks are limited to 4 seconds. If this was intentional, no problem at all! I still think the results in terms of quality are quite good.

I also wanted to mention something I noticed while using Gradio. I noticed that while streaming the audio, it’s not possible to download it directly. I believe the intention behind the current implementation is that each chunk gets exported automatically as it's processed, which works well for streaming longer audios. I saw that this is handled using yield, and it allows the stream to grow as chunks are processed.

However, I was wondering if it would be possible to unify all chunks into a single audio file once the processing is finished, removing the stream and providing the user with the complete audio file. Currently, because it’s streamed chunk by chunk, there are small silences between chunks and the crossfades don’t work well in the streamed version.

That being said, I tested the crossfade function and found it to be very effective. By modifying the code to progressively unify the chunks before passing them to yield, I was able to confirm that the crossfade works smoothly. The resulting MP3 file had a seamless transition between chunks. I’m attaching an image with three spectrograms to illustrate this:

Crossfade Top: The audio captured directly from the Gradio stream. Middle: The individual chunks placed side by side without crossfades (which results in audible clicks between them). Bottom: The final unified audio with the crossfade applied (this worked perfectly and produced a smooth transition).

Lastly, I wanted to ask if it would be possible to output the final unified audio as a WAV file instead of MP3. I noticed that MP3 introduces some quality loss, especially above 10kHz, and I believe a WAV file would better preserve the audio quality.

Thanks again for your work!

Thanks for your exhaustive feedback. Following your advice, I just added dual outputs in webui for both streaming and non-streaming output, please have a check to see whether it satisfies you needs

Drakni01 commented 4 days ago

Hi,

Thank you for implementing the recent updates; everything is working really well now! I noticed one small thing that I’m not sure about: the pitch seems to be slightly shifted upwards. I found that lowering it by -1.5 semitones helped correct it:

def adjust_f0_semitones(f0_sequence, n_semitones):
    corrected_semitones = n_semitones - 1.5
    factor = 2 ** (corrected_semitones / 12)
    return f0_sequence * factor

I’m not sure why the pitch is shifted, but this adjustment worked for me. Also, the switch to BigVGAN has greatly improved the quality!. Thank you for your hard work!

Plachtaa commented 4 days ago

Hi,

Thank you for implementing the recent updates; everything is working really well now! I noticed one small thing that I’m not sure about: the pitch seems to be slightly shifted upwards. I found that lowering it by -1.5 semitones helped correct it:

def adjust_f0_semitones(f0_sequence, n_semitones):
    corrected_semitones = n_semitones - 1.5
    factor = 2 ** (corrected_semitones / 12)
    return f0_sequence * factor

I’m not sure why the pitch is shifted, but this adjustment worked for me. Also, the switch to BigVGAN has greatly improved the quality!. Thank you for your hard work!

thanks for the feedback on pitch shift. I just checked it was a bug in my training code. I will use a temporary inference time solution for now and fix it in future model updates

ApocalypsezZ commented 4 days ago

https://github.com/RVC-Boss/GPT-SoVITS/blob/main/tools%2Fslicer2.py i use this tools to split audio in series vocal in 30s,and work well! you can find demo here,

【SeedVC-ComfyUI绕开原作30秒封印,现已支持整首歌曲变声,整合包在评论区哦!-哔哩哔哩】 https://b23.tv/Z1JZx6m

As the author mentioned, chunking is supported in the web UI, and crossfading is added between chunks for a smooth transition. Did you change it to a ComfyUI plugin?

AIFSH commented 3 days ago

https://github.com/RVC-Boss/GPT-SoVITS/blob/main/tools%2Fslicer2.py i use this tools to split audio in series vocal in 30s,and work well! you can find demo here,

【SeedVC-ComfyUI绕开原作30秒封印,现已支持整首歌曲变声,整合包在评论区哦!-哔哩哔哩】 https://b23.tv/Z1JZx6m

As the author mentioned, chunking is supported in the web UI, and crossfading is added between chunks for a smooth transition. Did you change it to a ComfyUI plugin?

yes, I uploaded