Inference with socket example have unexpected high speaking rate

moseshu commented 3 weeks ago

Checks

[X] This template is only for question, not feature requests or bug reports.
[X] I have thoroughly reviewed the project documentation and read the related paper(s).
[X] I have searched for existing issues, including closed ones, no similar questions.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Question details

follow the script python socket_server.py. 使用下面的代码去调用，生成的audio语速很快，和离线的效果差距很大，是配置的问题，还是其他问题？

import socket
import numpy as np
import asyncio
import pyaudio

async def listen_to_voice(text, server_ip='localhost', server_port=7777):
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client_socket.connect((server_ip, server_port))

    async def play_audio_stream():
        buffer = b''
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paFloat32,
                        channels=1,
                        rate=24000,  # Ensure this matches the server's sampling rate
                        output=True,
                        frames_per_buffer=2048)

        try:
            while True:
                chunk = await asyncio.to_thread(client_socket.recv, 1024)
                if not chunk:  # End of stream
                    break
                if b"END_OF_AUDIO" in chunk:
                    buffer += chunk.replace(b"END_OF_AUDIO", b"")
                    if buffer:
                        audio_array = np.frombuffer(buffer, dtype=np.float32).copy()  # Make a writable copy
                        stream.write(audio_array.tobytes())
                    break
                buffer += chunk
                if len(buffer) >= 4096:
                    audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy()  # Make a writable copy
                    stream.write(audio_array.tobytes())
                    buffer = buffer[4096:]
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()

    try:
        # Send only the text to the server
        await asyncio.to_thread(client_socket.sendall, text.encode('utf-8'))
        await play_audio_stream()
        print("Audio playback finished.")
    except Exception as e:
        print(f"Error in listen_to_voice: {e}")
    finally:
        client_socket.close()

# Example usage: Replace this with your actual server IP and port
async def main():
    await listen_to_voice("春天的江潮水势浩荡，与大海连成一片，一轮明月从海上升起，好像与潮水一起涌出来。月光照耀着春江，随着波浪闪耀千万里，所有地方的春江都有明亮的月光！江水曲曲折折地绕着花草丛生的原野流淌，月光照射着开遍鲜花的树林好像细密的雪珠在闪烁。月色如霜，所以霜飞无从觉察，洲上的白沙和月色融合在一起，看不分明。江水、天空成一色，没有一点微小灰尘，明亮的天空中只有一轮孤月高悬空中。江边上什么人最初看见月亮？江上的月亮哪一年最初照耀着人？人生一代代地无穷无尽，只有江上的月亮一年年地总是相像。不知江上的月亮等待着什么人，只见长江不断地一直运输着流水。游子像一片白云缓缓地离去，只剩下思妇站在离别的青枫浦不胜忧愁。哪家的游子今晚坐着小船在漂流？", server_ip='localhost', server_port=32023)

# Run the main async function
if __name__ == '__main__':

    import nest_asyncio
    nest_asyncio.apply()
    asyncio.run(main())

SWivid commented 3 weeks ago

using English to submit this report in order to facilitate communication. \ 使用下面的代码去调用，生成的audio语速很快，和离线的效果差距很大，是配置的问题，还是其他问题？ Using the following code to call, the generated audio speech speed is very fast, and the effect of offline is very large, is the problem of configuration, or other problems?

We haven't closely check this, may be with sampling rate and stuff. @kunci115 might help

amabilee commented 3 weeks ago

It can be: That the sampling rate and audio format used by both the server and client aren't consistent. or The buffer size might be too small, causing the audio to be played faster than intended

kunci115 commented 3 weeks ago

I've just debug in short time, so far from what i checked, it only happen with ch language text >50 token, make it short for immediate solution, since I don't really understand CH preprocessing

import socket
import numpy as np
import asyncio
import pyaudio
import re

def chunk_text(text, max_length=50):
    """
    Splits the input text into smaller chunks based on punctuation and length.
    Adjust max_length to control chunk size.
    """
    sentences = re.split(r'(?<=[。！？])', text)  # Adjust for Chinese punctuation
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) > max_length:
            chunks.append(current_chunk)
            current_chunk = sentence
        else:
            current_chunk += sentence

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

async def listen_to_voice(text, server_ip='localhost', server_port=9998):
    client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    client_socket.connect((server_ip, server_port))

    async def play_audio_stream():
        buffer = b''
        p = pyaudio.PyAudio()
        stream = p.open(format=pyaudio.paFloat32,
                        channels=1,
                        rate=24000,
                        output=True,
                        frames_per_buffer=2048)

        try:
            while True:
                chunk = await asyncio.get_event_loop().run_in_executor(None, client_socket.recv, 1024)
                if not chunk:  # End of stream
                    break
                if b"END_OF_AUDIO" in chunk:
                    buffer += chunk.replace(b"END_OF_AUDIO", b"")
                    if buffer:
                        audio_array = np.frombuffer(buffer, dtype=np.float32).copy()
                        stream.write(audio_array.tobytes())
                    break
                buffer += chunk
                if len(buffer) >= 4096:
                    audio_array = np.frombuffer(buffer[:4096], dtype=np.float32).copy()
                    stream.write(audio_array.tobytes())
                    buffer = buffer[4096:]
        finally:
            stream.stop_stream()
            stream.close()
            p.terminate()

    try:
        # Split text into chunks
        text_chunks = chunk_text(text)

        # Send each chunk, waiting for playback to finish before proceeding
        for chunk in text_chunks:
            await asyncio.get_event_loop().run_in_executor(None, client_socket.sendall, chunk.encode('utf-8'))
            await play_audio_stream()  # Play the current chunk fully before sending the next
            print(f"Finished playing chunk: {chunk}")

        print("Audio playback finished.")

    except Exception as e:
        print(f"Error in listen_to_voice: {e}")

    finally:
        client_socket.close()

# Example usage
async def main():
    await listen_to_voice(
        "春天的江潮水势浩荡，与大海连成一片，一轮明月从海上升起，好像与潮水一起涌出来。月光照耀着春江，随着波浪闪耀千万里，所有地方的春江都有明亮的月光！"
        "江水曲曲折折地绕着花草丛生的原野流淌，月光照射着开遍鲜花的树林好像细密的雪珠在闪烁。月色如霜，所以霜飞无从觉察，洲上的白沙和月色融合在一起，看不分明。"
        "江水、天空成一色，没有一点微小灰尘，明亮的天空中只有一轮孤月高悬空中。江边上什么人最初看见月亮？江上的月亮哪一年最初照耀着人？人生一代代地无穷无尽，只有江上的月亮一年年地总是相像。"
        "不知江上的月亮等待着什么人，只见长江不断地一直运输着流水。游子像一片白云缓缓地离去，只剩下思妇站在离别的青枫浦不胜忧愁。哪家的游子今晚坐着小船在漂流？",
        server_ip='localhost', server_port=9998
    )

# Run the main async function
asyncio.run(main())

ZhikangNiu commented 14 hours ago

Since this issue has been inactive for a long time, it will be closed. Feel free to reopen this issue and ask questions at any time.

SWivid / F5-TTS

Inference with socket example have unexpected high speaking rate #402

Checks

Question details