erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
816 stars 91 forks source link

(Support) Streaming to Unity #145

Closed bollerdominik closed 5 months ago

bollerdominik commented 5 months ago

First thanks for this great tool.

I am trying to integrate the API to unity with streaming mode. I copied this gist. I tried the code with a example chunked mp3 which I found and it worked. Unfortunately, I was not able to use it with the alltalk streaming API. I always get an error from Unity

Audio clip "" could not be played. FMOD Error: An invalid parameter was passed to this function.

So I don't know if this is a unity issue or a issue with the chunked WAV file from the streaming API. If anyone has any suggestion I would greatly appreciate it.

erew123 commented 5 months ago

Hi @bollerdominik

I'm no expert on Unity or FOMD. However, Im kind of wondering if there could be some mis-match with this: [Min(1024)] public int bytesToDownloadBeforePlaying = 100000; (from the link you sent). Maybe its not getting enough data before it starts playing? Dont quote my maths here, but I think based on everything below, AllTalk will send out at 48,000 bytes per second and the code snippet above is waiting for 100,000 bytes before it starts playing.

Perhaps its as simple as the FOMD code doesn't return a "yes I got that thanks, now you can send the next block over" if it hasn't received 100,000 bytes and maybe AllTalk is waiting for that response before it sends the next 48,000 bytes over.

So maybe try a lower figure on that code block e.g. [Min(1024)] public int bytesToDownloadBeforePlaying = 24000; and see what happens.

Alternatively there are a couple of things you could try in the AllTalk code. Here is a detailed breakdown of what AllTalk does to generate the response and what the settings are (so you could match up if FOMD will play this kind of file and handle this kind of response. In the tts_server.py you can find the following:

        if streaming:
            # Streaming-specific operations
            file_chunks = []
            wav_buf = io.BytesIO()
            with wave.open(wav_buf, "wb") as vfout:
                vfout.setnchannels(1)
                vfout.setsampwidth(2)
                vfout.setframerate(24000)
                vfout.writeframes(b"")
            wav_buf.seek(0)
            yield wav_buf.read()

            for i, chunk in enumerate(output):
                file_chunks.append(chunk)
                if isinstance(chunk, list):
                    chunk = torch.cat(chunk, dim=0)
                chunk = chunk.clone().detach().cpu().numpy()
                chunk = chunk[None, : int(chunk.shape[0])]
                chunk = np.clip(chunk, -1, 1)
                chunk = (chunk * 32767).astype(np.int16)
                yield chunk.tobytes()

and it is returned with audio/wav as a header:

return StreamingResponse(response, media_type="audio/wav")

You certainly cant play with vfout.setframerate(24000) as the model only outputs at 24000, so changing this would speed up or slow down the way the sound plays.

But the standard return would be a 16-bit, mono audio file with a sample rate of 24 kHz

As for the code that generates the streaming response:

file_chunks = []: Initializes an empty list to hold parts of the audio data (chunks) that will be streamed.

wav_buf = io.BytesIO(): Creates a buffer in memory using BytesIO. This buffer acts as a temporary storage for the WAV file being constructed.

Setting Up WAV File Parameters The WAV format settings are specified using the wave module:

Opening WAV Buffer for Writing: with wave.open(wav_buf, "wb") as vfout: opens the memory buffer for writing audio data in WAV format. The 'wb' mode indicates writing in binary format.

Number of Channels (setnchannels(1)): Specifies that the audio will be mono, meaning there is only one audio channel. Stereo audio, for example, would use two channels.

Sample Width (setsampwidth(2)): Sets the sample width to 2 bytes (16 bits). This indicates each audio sample is represented by 16 bits, typical for CD-quality audio.

Frame Rate (setframerate(24000)): Sets the sample rate (or frame rate) to 24,000 Hz. This is the number of samples per second and affects the audio's pitch and quality. A rate of 24,000 Hz is lower than CD quality (44,100 Hz) but is sufficient for clear speech.

Initializing Frames: vfout.writeframes(b"") writes an initial empty frames data to the WAV file. This can be seen as preparing the file for data to be added.

After setting up the WAV file parameters, the code prepares to stream the audio data:

Resetting Buffer Position: wav_buf.seek(0) resets the position in the buffer to the beginning, making it ready for reading.

Yielding Initial Data: yield wav_buf.read() sends the initial WAV file header (and any other setup data written so far) to the consumer of this generator. This is the first piece of data streamed to the client.

Processing and Streaming Chunks: The loop for i, chunk in enumerate(output): iterates over output, which contains the audio data to be streamed, in chunks.

Chunk Aggregation (if applicable): If a chunk is a list (potentially of smaller chunks or tensors), it's concatenated into a single tensor using torch.cat(chunk, dim=0).

The audio data is detached from any computation graph (clone().detach()), moved to CPU memory if not already (cpu()), and converted to a NumPy array (numpy()).

The data is then ensured to be in a 2D shape (chunk[None, :]) for consistency.

The audio samples are clipped to the range [-1, 1] to ensure no values exceed this range, avoiding distortion.

The clipped samples are scaled to the 16-bit integer range and converted to 16-bit integers *((chunk 32767).astype(np.int16))**. This is because the WAV format expects samples to be in this format.

Yielding Processed Audio: yield chunk.tobytes() converts the processed audio chunk to bytes and yields it for streaming. This step is repeated for each chunk in the output, allowing the audio to be streamed in parts as it is processed.

Hopefully that gives you something to go on.

Thanks

bollerdominik commented 5 months ago

Hi Thanks for your answer. Unfortunately, I could not get it to work with bytesToDownloadBeforePlaying = 24000; or other values. Since it is most likely not a issue with alltalk_tts and more something with Unity I will close the issue (and keep your issue tracker nice & small ;) )