Output audio "format not recognized"

Josephrp commented 9 months ago

Im making a demo for the whisperspeech and run into an error , see here to see the disscussion and make a PR :

https://huggingface.co/spaces/Tonic/laion-whisper/discussions/1

this is the error :

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 164, in thread_wrapper
    res = future.result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/app/app.py", line 54, in whisper_speech_demo
    sf.write(tmp_file_name, audio_np, 24000)
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 343, in write
    with SoundFile(file, 'w', samplerate, channels,
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 658, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/user/.local/lib/python3.10/site-packages/soundfile.py", line 1216, in _open
    raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction
    output = await route_utils.call_process_api(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/route_utils.py", line 232, in call_process_api
    output = await app.get_blocks().process_api(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1561, in process_api
    result = await self.call_function(
  File "/home/user/.local/lib/python3.10/site-packages/gradio/blocks.py", line 1179, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/.local/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
    return await future
  File "/home/user/.local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run
    result = context.run(func, *args)
  File "/home/user/.local/lib/python3.10/site-packages/gradio/utils.py", line 678, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/.local/lib/python3.10/site-packages/spaces/zero/wrappers.py", line 118, in gradio_handler
    raise res.value
soundfile.LibsndfileError: Error opening '/tmp/tmp69sgx7yk.wav': Format not recognised.

here is the code :

https://huggingface.co/spaces/Tonic/laion-whisper/blob/main/app.py

import spaces
import tempfile
import gradio as gr
import os
from whisperspeech.pipeline import Pipeline
import torch
import soundfile as sf
import numpy as np
import torch.nn.functional as F
from whisperspeech.languages import LANGUAGES
from whisperspeech.pipeline import Pipeline
from whisperspeech.utils import resampler

title = """# 🙋🏻‍♂️ Welcome to🌟Tonic's🌬️💬📝WhisperSpeech

You can use this ZeroGPU Space to test out the current model [🌬️💬📝collabora/whisperspeech](https://huggingface.co/collabora/whisperspeech). 🌬️💬📝collabora/whisperspeech is An Open Source text-to-speech system built by inverting Whisper. Previously known as spear-tts-pytorch. It's like Stable Diffusion but for speech – both powerful and easily customizable.
You can also use 🌬️💬📝WhisperSpeech by cloning this space. 🧬🔬🔍 Simply click here: <a style="display:inline-block" href="https://huggingface.co/spaces/Tonic/laion-whisper?duplicate=true"><img src="https://img.shields.io/badge/-Duplicate%20Space-blue?labelColor=white&style=flat&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAAXNSR0IArs4c6QAAAP5JREFUOE+lk7FqAkEURY+ltunEgFXS2sZGIbXfEPdLlnxJyDdYB62sbbUKpLbVNhyYFzbrrA74YJlh9r079973psed0cvUD4A+4HoCjsA85X0Dfn/RBLBgBDxnQPfAEJgBY+A9gALA4tcbamSzS4xq4FOQAJgCDwV2CPKV8tZAJcAjMMkUe1vX+U+SMhfAJEHasQIWmXNN3abzDwHUrgcRGmYcgKe0bxrblHEB4E/pndMazNpSZGcsZdBlYJcEL9Afo75molJyM2FxmPgmgPqlWNLGfwZGG6UiyEvLzHYDmoPkDDiNm9JR9uboiONcBXrpY1qmgs21x1QwyZcpvxt9NS09PlsPAAAAAElFTkSuQmCC&logoWidth=14" alt="Duplicate Space"></a></h3> 
Join us : 🌟TeamTonic🌟 is always making cool demos! Join our active builder's🛠️community 👻  [![Join us on Discord](https://img.shields.io/discord/1109943800132010065?label=Discord&logo=discord&style=flat-square)](https://discord.gg/GWpVpekp) On 🤗Huggingface: [TeamTonic](https://huggingface.co/TeamTonic) & [MultiTransformer](https://huggingface.co/MultiTransformer) On 🌐Github: [Polytonic](https://github.com/tonic-ai) & contribute to 🌟 [Poly](https://github.com/tonic-ai/poly) 🤗Big thanks to Yuvi Sharma and all the folks at huggingface for the community grant 🤗
"""

@spaces.GPU
def whisper_speech_demo(text, lang, speaker_audio, mix_lang, mix_text):
    pipe = Pipeline()
    speaker_url = None

    if speaker_audio is not None:
        speaker_url = speaker_audio

    if mix_lang and mix_text:
        mixed_langs = lang.split(',') + mix_lang.split(',')
        mixed_texts = [text] + mix_text.split(',')
        stoks = pipe.t2s.generate(mixed_texts, lang=mixed_langs)
        audio_data = pipe.generate(stoks, speaker_url, lang=mixed_langs[0])
    else:
        audio_data = pipe.generate(text, speaker_url, lang)

    resample_audio = resampler(newsr=24000)
    audio_data_resampled = next(resample_audio([{'sample_rate': 22050, 'samples': audio_data.cpu()}]))['samples_24k']

    # Normalize and write to a WAV file
    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp_file:
        tmp_file_name = tmp_file.name
        audio_np = audio_data_resampled.numpy()  # Convert to numpy array

        # Normalize if necessary
        if audio_np.max() > 1.0 or audio_np.min() < -1.0:
            audio_np = audio_np / np.max(np.abs(audio_np))

        # Ensure the audio data is 2D (num_samples, num_channels)
        if audio_np.ndim == 1:
            audio_np = np.expand_dims(audio_np, axis=1)

        # Write the file
        sf.write(tmp_file_name, audio_np, 24000)

    return tmp_file_name

with gr.Blocks() as demo:
    gr.Markdown(title)

    with gr.Tabs():
        with gr.TabItem("🌬️💬📝Standard TTS"):
            with gr.Row():
                text_input_standard = gr.Textbox(label="Enter text")
                lang_input_standard = gr.Dropdown(choices=list(LANGUAGES.keys()), label="Language")
                speaker_input_standard = gr.Audio(label="Upload or Record Speaker Audio (optional)", sources=["upload", "microphone"], type="filepath")
                placeholder_mix_lang = gr.Textbox(visible=False)  # Placeholder, hidden
                placeholder_mix_text = gr.Textbox(visible=False)  # Placeholder, hidden
                generate_button_standard = gr.Button("Generate Speech")
            output_audio_standard = gr.Audio(label="🌬️💬📝WhisperSpeech")

            generate_button_standard.click(
                whisper_speech_demo,
                inputs=[text_input_standard, lang_input_standard, speaker_input_standard, placeholder_mix_lang, placeholder_mix_text],
                outputs=output_audio_standard
            )

        with gr.TabItem("🌬️💬📝Mixed Language TTS"):
            with gr.Row():
                placeholder_text_input = gr.Textbox(visible=False)  # Placeholder, hidden
                placeholder_lang_input = gr.Dropdown(choices=[], visible=False)  # Placeholder, hidden
                placeholder_speaker_input = gr.Audio(visible=False)  
                mix_lang_input_mixed = gr.CheckboxGroup(choices=list(LANGUAGES.keys()), label="Select Languages")
                mix_text_input_mixed = gr.Textbox(label="Enter mixed language text", placeholder="e.g., Hello, Cześć")
                generate_button_mixed = gr.Button("Generate Mixed Speech")
            output_audio_mixed = gr.Audio(label="Mixed🌬️💬📝WhisperSpeech")

            generate_button_mixed.click(
                whisper_speech_demo,
                inputs=[placeholder_text_input, placeholder_lang_input, placeholder_speaker_input, mix_lang_input_mixed, mix_text_input_mixed],
                outputs=output_audio_mixed
            )

demo.launch()

would love some direction to resolve the returns on this one :-)

jpc commented 9 months ago

Hey, I don’t see what’s wrong with this from looking at the code. Maybe you could try with torchaudio? pipeline.py has an example of saving the result to a file which I tested before.

Josephrp commented 9 months ago

I think I got it but used some custom resampling which didn’t work , for the latest code base check the space here : https://huggingface.co/spaces/tonic/laion-whisper (to be renamed!)

😉🙏🏻🚀

On Sat 20 Jan 2024 at 14:28, Jakub Piotr Cłapa @.***> wrote:

Hey, I don’t see what’s wrong with this from looking at the code. Maybe you could try with torchaudio? pipeline.py has an example of saving the result to a file which I tested before.

— Reply to this email directly, view it on GitHub https://github.com/collabora/WhisperSpeech/issues/57#issuecomment-1902094216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEK6QQGLWNPPJJRIQOT7GADYPPA65AVCNFSM6AAAAABCCYEI5SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBSGA4TIMRRGY . You are receiving this because you authored the thread.Message ID: @.***>

Josephrp commented 9 months ago

fixed :-)

collabora / WhisperSpeech

Output audio "format not recognized" #57