audio输出格式的兼容性问题

waylonwang commented 4 months ago

输出的AUDIO，如果输入到ComfyUI-VideoHelperSuite的Audio to legacy VHS_AUDIO中会报错，我查了下nodes.py，最后的输出只是简单的使用了python列表，而不是3D Tensor格式： audio = {"waveform": [output['tts_speech']],"sample_rate":target_sr}

调试CosyVoice-ComfyUI输出的内容为： ${'waveform': [tensor([[1.5498e-05, 7.1867e-06, 8.9236e-06, ..., 8.0358e-03, 9.0343e-03, 9.4499e-03]])], 'sample_rate': 22050}

而使用ComfyUI-VideoHelperSuite的LoadAudio加载音频输出的内容为： ${'waveform': tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 3.0518e-05, 6.1035e-05, 1.5259e-04]]]), 'sample_rate': 22050}

并且使用ComfyUI自带的LoadAudio加载音频输出的内容也为： ${'waveform': tensor([[[0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 3.0518e-05, 6.1035e-05, 1.5259e-04]]]), 'sample_rate': 22050}

问了下ChatGPT:

对于单个音频片段，使用2D Tensor（(channels, samples)）。
对于批量处理多个音频片段，使用3D Tensor（(batch_size, channels, samples)）。
推荐使用Tensor格式而不是Python列表的形式，以便于效率、兼容性和一致性。

因此，建议CosyVoice-ComfyUI将audio的输出改为3D Tensor格式以提高兼容性.

waylonwang commented 4 months ago

这样转换一下即可： audio = {"waveform": torch.stack([output['tts_speech']]),"sample_rate":target_sr}

AIFSH commented 4 months ago

感谢建议，已修改

AIFSH / CosyVoice-ComfyUI

audio输出格式的兼容性问题 #15