RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
36.16k stars 4.13k forks source link

关于在 VCTK 数据集上评估 GPT-SoVITS 性能指标的咨询 Consultation on evaluating GPT-SoVITS performance metrics on the VCTK dataset #1710

Closed CamellIyquitous closed 1 month ago

CamellIyquitous commented 1 month ago

大家好,我是一名研究 TTS 的学生。目前,我想基于 VCTK 数据集,对 GPT-SoVITS 的生成语音质量等性能指标进行评估。

我的思路如下: 我准备好了 VCTK 数据集中大量的 GT(Ground Truth)音频(/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/wav48_silence_trimmed/)以及对应的 GT 文本(/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/txt);我现在想基于 GPT-SOVITS v2 的官方预训练模型,进行以下大量重复操作推理(即批量推理):利用 VCTK 数据集中的 GT 音频作为参考音频(如 Please call Stella_GT. wav),对应的 GT 文本同时作为参考文本目标文本(如 'Please call Stella.'),进行 zero-shot 推理,生成目标音频(如 Please call Stella_Predict.wav);当得到了生成的目标音频,我就可以将它和原参考音频进行音频相似性等比较,从而完成对 GPT-SoVITS 生成语音质量的性能评估;接下来循环处理下一条音频,直到数据集中的音频全被处理完成;假如我的 VCTK 数据集中有 20,000 条 GT .wav音频,最终完成上述批量推理后,我将得到对应的 20,000 条 Predicted .wav 音频。

请问有什么方法可以完成上述批量推理?十分感谢!!!

PS:我所使用的环境是 Windows 连接 Linux 远程服务器。 【我尝试了使用 python api.py,但是我发现这样每一次推理我都得自己手动设置参考音频参考文本。同时,我发现我还得将目标文本粘贴到浏览器页面上(如:http://127.0.0.1:9880?text=先帝创业未半而中道崩殂,今天下三分,益州疲弊,此诚危急存亡之秋也。&text_language=zh),终端才会进行推理,且对于推理得到的生成**目标音频**,我不知道如何保存到自己想要的路径下面。】

Hello everyone, I am a student studying TTS. Currently, I want to evaluate the performance indicators such as the generated speech quality of GPT-SoVITS based on the VCTK dataset.

My idea is as follows: I have prepared a large number of GT (Ground Truth) audios in the VCTK dataset (/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/wav48_silence_trimmed/) and the corresponding GT texts (/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/txt); I now want to perform the following large number of repeated operations (i.e. batch reasoning) based on the official pre-trained model of GPT-SOVITS v2: Using VCTK The GT audio in the dataset is used as the reference audio (such as Please call Stella_GT.wav), and the corresponding GT text is used as both the reference text and the target text (such as 'Please call Stella.') for zero-shot reasoning to generate the target audio (such as Please call Stella_Predict.wav); when the generated target audio is obtained, I can compare it with the original reference audio for audio similarity, thereby completing the performance evaluation of the speech quality generated by GPT-SoVITS; then the next audio is processed in a loop until all the audio in the dataset is processed; if there are 20,000 GT .wav audios in my VCTK dataset, after completing the above batch reasoning, I will get the corresponding 20,000 Predicted .wav audios.

Is there any way to complete the above batch reasoning? Thank you very much! ! !

PS: The environment I use is Windows connected to a Linux remote server. 【I tried to use python api.py, but I found that I had to manually set the reference audio and reference text for each inference. At the same time, I found that I had to paste the target text into the browser page (such as: http://127.0.0.1:9880?text=先帝创业未半而中道崩殂,今天下三分,益州疲弊,此诚危急存亡之秋也。&text_language=zh) before the terminal can perform inference. And for the generated target audio obtained by inference, I don’t know how to save it to the path I want. 】

Chi8wah commented 1 month ago

不是有 api_v2.py 吗,跑起来之后自己在 windows 本地写个 py 脚本批量填字段并发请求到 linux 远程服务器 /tts 接口应该就好了?只是可能耗时长一点。

CamellIyquitous commented 1 month ago

不是有 api_v2.py 吗,跑起来之后自己在 windows 本地写个 py 脚本批量填字段并发请求到 linux 远程服务器 /tts 接口应该就好了?只是可能耗时长一点。

嗯嗯是的,我在linux服务器上开了两个终端,另外一个终端发送请求就ok了~

感谢感谢!

xipingL commented 1 month ago

你好,可以分享下相关的技术调研吗?我不是学生但是我目前也在做相关的研究,只不过并没有你做的专业。