RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
33.6k stars 3.85k forks source link

Stack overflow occurs probabilistically when calling function `pack_ogg` in `api.py` with converting a large audio data #1199

Closed AkagawaTsurunaki closed 3 months ago

AkagawaTsurunaki commented 3 months ago

When I specify the mode for ogg streaming inference, if the synthesized audio data is too large, a stack overflow error will be raised probabilistically while converting numpy vector to the ogg audio file.

Next, I'll show you how to reproduce the problem.

The script command to start API is as follows:

python.exe api.py -sm n -d cuda -p 11014 -a 127.0.0.1 -mt ogg

The HTTP request is as follows:

{
    "text": "二次元,原指“二维世界”,包含长度和宽度的二维空间。后来成为了ACGN亚文化圈专门用语,特意用“二次元”来指代。由二维图像的动画、漫画、游戏等作品构成,简单来讲就是在纸面或屏幕等平面上所呈现的动画、游戏等平面视觉作品,里面的角色都是图像形式,区别于真人饰演的影视剧,因此被称为“纸片人”。通过这些载体创造的虚拟世界被动漫爱好者称为“二次元世界”,简称“二次元”。在某种意义上讲,二次元还意指喜爱它的人就像生活在一个平面世界中一样。“二次元”是一个平面媒体所表达的“异次元”,因其二维空间本质被称为“二次元”。“二次元”词义逐渐脱离原本的空间属性,并派生出与主流文化相对独立的次文化体系,也就是“二次元”文化。",
    "text_language": "zh",
    "refer_wav_path": "D:/AkagawaTsurunaki/Dataset/Audio/Prompts/[zh]喜欢游戏的人和擅长游戏的人有很多不一样的地方,老师属于哪一种呢?.wav",
    "prompt_text": "喜欢游戏的人和擅长游戏的人有很多不一样的地方,老师属于哪一种呢?",
    "prompt_language": "zh"
}

The log is as follows:

D:\AkagawaTsurunaki\WorkSpace\PycharmProjects\RVC-Boss\GPT-SoVITS\api.py:648: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(f"未指定SoVITS模型路径, fallback后当前值: {sovits_path}")
WARNING:  未指定SoVITS模型路径, fallback后当前值: GPT_SoVITS/pretrained_models/s2G488k.pth
D:\AkagawaTsurunaki\WorkSpace\PycharmProjects\RVC-Boss\GPT-SoVITS\api.py:651: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(f"未指定GPT模型路径, fallback后当前值: {gpt_path}")
WARNING:  未指定GPT模型路径, fallback后当前值: GPT_SoVITS/pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt
INFO:     未指定默认参考音频
INFO:     半精: True
INFO:     流式返回已开启
INFO:     编码格式: ogg
...
DEBUG:jieba_fast:Loading model cost 0.444 seconds.
DEBUG:jieba_fast:Prefix dict has been built succesfully.
  0%|          | 0/1500 [00:00<?, ?it/s]D:\AkagawaTsurunaki\Models\GPT-SoVITS\GPT_SoVITS\AR\modules\patched_mha_with_cache.py:452: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = scaled_dot_product_attention(
 81%|████████  | 1215/1500 [00:15<00:03, 76.61it/s]
T2S Decoding EOS [198 -> 1414]
D:\AkagawaTsurunaki\ProgramFiles\Anaconda\envs\GPTSoVits\lib\site-packages\torch\functional.py:660: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:879.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]

Process finished with exit code -1073741571 (0xC00000FD)

You can see that the program exits abnormally, and this exit code (-1073741571) indicates a stack overflow.

My solution is referred to Stack Overflow while writing OGG · Issue #396 · bastibe/python-soundfile (github.com).

This is the original code of pack_ogg in api.py:

def pack_ogg(audio_bytes, data, rate):
    with sf.SoundFile(audio_bytes, mode='w', samplerate=rate, channels=1, format='ogg') as audio_file:
        audio_file.write(data)

    return audio_bytes

and I changed it to:

def pack_ogg(audio_bytes, data, rate):
    # Author: AkagawaTsurunaki
    # Issue:
    #   Stack overflow probabilistically occurs
    #   when the function `sf_writef_short` of `libsndfile_64bit.dll` is called
    #   using the Python library `soundfile`
    # Note:
    #   This is an issue related to `libsndfile`, not this project itself.
    #   It happens when you generate a large audio tensor (about 499804 frames in my PC)
    #   and try to convert it to an ogg file.
    # Related:
    #   https://github.com/RVC-Boss/GPT-SoVITS/issues/1199
    #   https://github.com/libsndfile/libsndfile/issues/1023
    #   https://github.com/bastibe/python-soundfile/issues/396
    # Suggestion:
    #   Or split the whole audio data into smaller audio segment to avoid stack overflow?

    def handle_pack_ogg():
        with sf.SoundFile(audio_bytes, mode='w', samplerate=rate, channels=1, format='ogg') as audio_file:
            audio_file.write(data)

    import threading
    # See: https://docs.python.org/3/library/threading.html
    # The stack size of this thread is at least 32768
    # If stack overflow error still occurs, just modify the `stack_size`.
    # stack_size = n * 4096, where n should be a positive integer.
    # Here we chose n = 4096.
    stack_size = 4096 * 4096
    try:
        threading.stack_size(stack_size)
        pack_ogg_thread = threading.Thread(target=handle_pack_ogg)
        pack_ogg_thread.start()
        pack_ogg_thread.join()
    except RuntimeError as e:
        # If changing the thread stack size is unsupported, a RuntimeError is raised.
        print("RuntimeError: {}".format(e))
        print("Changing the thread stack size is unsupported.")
    except ValueError as e:
        # If the specified stack size is invalid, a ValueError is raised and the stack size is unmodified.
        print("ValueError: {}".format(e))
        print("The specified stack size is invalid.")

    return audio_bytes

This way, converting larger audio data to OGG format should not cause stack overflow issues.

The is because soundfile has a stack overflow error when calling libsndfile_64bit.dll. As far as I know, adjusting the stack size is a relatively simple and feasible solution at present.

I have reported this issue to libsndfile and waiting for it to be fixed.

For more details, please see:

Stack overflow probabilistically occurs when the function sf_writef_short of libsndfile_64bit.dll is called using the Python library soundfile · Issue #1023 · libsndfile/libsndfile (github.com)

AkagawaTsurunaki commented 3 months ago

I submitted a PR, I tested the largest audio the model GPT-SoVITS was capable of synthesizing and converted it to OGG format, and the program worked fine.