关于在 VCTK 数据集上评估 GPT-SoVITS 性能指标的咨询 Consultation on evaluating GPT-SoVITS performance metrics on the VCTK dataset

CamellIyquitous commented 1 month ago

大家好，我是一名研究 TTS 的学生。目前，我想基于 VCTK 数据集，对 GPT-SoVITS 的生成语音质量等性能指标进行评估。

我的思路如下：我准备好了 VCTK 数据集中大量的 GT（Ground Truth）音频（/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/wav48_silence_trimmed/）以及对应的 GT 文本（/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/txt）；我现在想基于 GPT-SOVITS v2 的官方预训练模型，进行以下大量重复操作推理（即批量推理）：利用 VCTK 数据集中的 GT 音频作为参考音频（如 Please call Stella_GT. wav），对应的 GT 文本同时作为参考文本和目标文本（如 'Please call Stella.'），进行 zero-shot 推理，生成目标音频（如 Please call Stella_Predict.wav）；当得到了生成的目标音频，我就可以将它和原参考音频进行音频相似性等比较，从而完成对 GPT-SoVITS 生成语音质量的性能评估；接下来循环处理下一条音频，直到数据集中的音频全被处理完成；假如我的 VCTK 数据集中有 20,000 条 GT .wav音频，最终完成上述批量推理后，我将得到对应的 20,000 条 Predicted .wav 音频。

请问有什么方法可以完成上述批量推理？十分感谢！！！

PS：我所使用的环境是 Windows 连接 Linux 远程服务器。【我尝试了使用 python api.py，但是我发现这样每一次推理我都得自己手动设置参考音频和参考文本。同时，我发现我还得将目标文本粘贴到浏览器页面上（如：http://127.0.0.1:9880?text=先帝创业未半而中道崩殂，今天下三分，益州疲弊，此诚危急存亡之秋也。&text_language=zh），终端才会进行推理，且对于推理得到的生成**目标音频**，我不知道如何保存到自己想要的路径下面。】

Hello everyone, I am a student studying TTS. Currently, I want to evaluate the performance indicators such as the generated speech quality of GPT-SoVITS based on the VCTK dataset.

My idea is as follows: I have prepared a large number of GT (Ground Truth) audios in the VCTK dataset (/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/wav48_silence_trimmed/) and the corresponding GT texts (/.cache/huggingface/datasets/downloads/extracted/3872ee9e6539cb83e580670f1108d3c0a5492c5b5944ef45edb35cf44cdd01f0/txt); I now want to perform the following large number of repeated operations (i.e. batch reasoning) based on the official pre-trained model of GPT-SOVITS v2: Using VCTK The GT audio in the dataset is used as the reference audio (such as Please call Stella_GT.wav), and the corresponding GT text is used as both the reference text and the target text (such as 'Please call Stella.') for zero-shot reasoning to generate the target audio (such as Please call Stella_Predict.wav); when the generated target audio is obtained, I can compare it with the original reference audio for audio similarity, thereby completing the performance evaluation of the speech quality generated by GPT-SoVITS; then the next audio is processed in a loop until all the audio in the dataset is processed; if there are 20,000 GT .wav audios in my VCTK dataset, after completing the above batch reasoning, I will get the corresponding 20,000 Predicted .wav audios.

Is there any way to complete the above batch reasoning? Thank you very much! ! !

PS: The environment I use is Windows connected to a Linux remote server. 【I tried to use python api.py, but I found that I had to manually set the reference audio and reference text for each inference. At the same time, I found that I had to paste the target text into the browser page (such as: http://127.0.0.1:9880?text=先帝创业未半而中道崩殂，今天下三分，益州疲弊，此诚危急存亡之秋也。&text_language=zh) before the terminal can perform inference. And for the generated target audio obtained by inference, I don’t know how to save it to the path I want. 】

Chi8wah commented 1 month ago

不是有 api_v2.py 吗，跑起来之后自己在 windows 本地写个 py 脚本批量填字段并发请求到 linux 远程服务器 /tts 接口应该就好了？只是可能耗时长一点。

CamellIyquitous commented 1 month ago

不是有 api_v2.py 吗，跑起来之后自己在 windows 本地写个 py 脚本批量填字段并发请求到 linux 远程服务器 /tts 接口应该就好了？只是可能耗时长一点。

嗯嗯是的，我在linux服务器上开了两个终端，另外一个终端发送请求就ok了~

感谢感谢！

xipingL commented 1 month ago

你好，可以分享下相关的技术调研吗？我不是学生但是我目前也在做相关的研究，只不过并没有你做的专业。

RVC-Boss / GPT-SoVITS

关于在 VCTK 数据集上评估 GPT-SoVITS 性能指标的咨询 Consultation on evaluating GPT-SoVITS performance metrics on the VCTK dataset #1710

大家好，我是一名研究 TTS 的学生。目前，我想基于 VCTK 数据集，对 GPT-SoVITS 的生成语音质量等性能指标进行评估。

Hello everyone, I am a student studying TTS. Currently, I want to evaluate the performance indicators such as the generated speech quality of GPT-SoVITS based on the VCTK dataset.