FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.47k stars 698 forks source link

data_pipeline中sort是必要的处理步骤吗 #636

Open shenlou11 opened 2 weeks ago

shenlou11 commented 2 weeks ago

请问一下,该方法是否可以删除,影响性能?还是影响效果?

def sort(data, sort_size=500, mode='train'): """ Sort the data by feature length. Sort is used after shuffle and before batch, so we can group utts with similar lengths into a batch, and sort_size should be less than shuffle_size

    Args:
        data: Iterable[{key, feat, label}]
        sort_size: buffer size for sort

    Returns:
        Iterable[{key, feat, label}]
"""

buf = []
for sample in data:
    buf.append(sample)
    if len(buf) >= sort_size:
        buf.sort(key=lambda x: x['speech_feat'].size(0))
        for x in buf:
            yield x
        buf = []
# The sample left over
buf.sort(key=lambda x: x['speech_feat'].size(0))
for x in buf:
    yield x
aluminumbox commented 1 week ago

不要删除,会提升速度