YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
367 stars 33 forks source link

API access to Gradio demos broken? #2

Closed jpgard closed 1 year ago

jpgard commented 1 year ago

Hi, congratulations on the great work! Thanks for the open-source demo(s). Really looking forward to trying it out, as it could be really useful.

I have a somewhat large set of files I'd like to run through the LTU-2 demo (around 1000 files). It seems the best way to do this is by using the API endpoint via the Hugging Face space. However, when I try to query the API using the sample code provided, it hangs indefinitely and never returns a response.

I'm using the code copied verbatim from the usage example (by "Use via API" in the footer of the demo page).

Here's the exact code:

from gradio_client import Client

client = Client("https://yuangongfdu-ltu-2.hf.space/")
result = client.predict(
                "https://github.com/gradio-app/gradio/raw/main/test/test_files/audio_sample.wav",   # str (filepath or URL to file)
                "Howdy!",   # str in 'Edit the textbox to ask your own questions!' Textbox component
                api_name="/predict"
)
print(result)

The code hangs permanently and never returns. If I interrupt the process while it's hanging, it looks like the client is waiting for a result.

Can you provide any guidance on how to interact with the API -- in particular, a working example of how to do so using a local audio file? I know the v2 model is in beta; if so, can you give an example of a working command to query the v1 API (which is not on Hugging Face spaces)? I also tried using the demo code for the v1's API (by clicking the interactive demo and "Use via API"), but the Python code example in that demo also returns an error, not a valid response.

I am using Python 3.10, transformers 4.28.0, and gradio 3.23.0.

YuanGongND commented 1 year ago

hi Josh,

Thanks so much for the kind words.

I am checking the issues you are mentioning, and give an update soon.

Before I solve this issue (it seems to be an HF one), I guess you can do this:

https://huggingface.co/spaces/yuangongfdu/ltu-2/blob/main/app.py

Copy the script to your local place, remove the main function - iteratively call the predict(audio_path, question), would that work?

-Yuan

jpgard commented 1 year ago

Thanks for the reply!

I followed your suggestion. It seems like perhaps the file upload is failing? After a while, the upload_audio() function returns a response 'Please upload an audio file.' I am providing a path to a valid 16khz mono wav file of 10 seconds. (The file is only 320kb in size.)

YuanGongND commented 1 year ago

I just checked - the server needs a reboot. Will let you know once it is done.

YuanGongND commented 1 year ago

hi Josh,

The reason for your V2 issue was actually our server was dead. You can call use HF space API.

I made a quick demo at:

https://github.com/YuanGongND/ltu/blob/main/ltu2_api_demo.py

I briefly tested that and it worked for me. Please let me know if that works.

V1 seems to be another problem, I saw an error message in the log, could you try your audio with the GUI version, if success, then switch to API. I know someone has used that API. The problem with V1 is it is not HF space, so has a 3-day limit, I need to restart every 3 days.

-Yuan

jpgard commented 1 year ago

Great! Thanks @YuanGongND! Confirmed it is working on my end now. Thanks for looking at this.

If the v2 model is up, stable, and is also the version you recommend to use for general audio tasks, then I won't plan to use v1. (However, I wasn't sure before about the warning on the v2 demo that the model is under construction and may be buggy, which made me think maybe v1 was the preferred stable option).

I will make an effort not to overload the API with requests since it sounds like you are manually hosting the backend. Let me know if you have any idea of what a reasonable rate it; probably won't be making more than 1 request every 5 seconds.

YuanGongND commented 1 year ago

V2 and V1 have a dramatic difference in text understanding, and have many different behaviors. It would be nice to make a comparison.

You can use as high load as possible - it is up to you! But the model is a bit slow (~10s for 10s new audio, ~2-3s if you ask a different question for cached audio).

Just to provide more info for your experiment, the V2 model only considers the first 10s of the audio even if you input longer ones, however, if the input contains text, it does take the text of the entire audio clip as input. So - if the input is longer, the process time is also longer. If the input is very long, there could be an OOM issue. We mainly test on 10s audios, if your audio is just a few seconds longer than 10s, it is safe to input to the system (the system trim it automatically), otherwise, you could consider trimming your audio to 10s and then input to LTU.

-Yuan

jpgard commented 1 year ago

This is super helpful information, thank you.

Just to make sure I understand: if there is no text input, the v2 model will truncate the audio to 10s. But if there is text input, the v2 model will consider the whole clip (but may OOM)?

YuanGongND commented 1 year ago

Just to make sure I understand: if there is no text input, the v2 model will truncate the audio to 10s. But if there is text input, the v2 model will consider the whole clip (but may OOM)?

By text, I mean "spoken text" in the audio file, not the "text" question

Sample:

A 20-second music with first 10-second guitar and second 10-second piano, the entire 20-seconds contains lyrics.

The information got by LTU-2 would be: guitar sounds and entire lyrics. The piano sound would be ignored.

-Yuan

YuanGongND commented 1 year ago

@jpgard

Hi there, just a friendly reminder -

Our server uses the filename as the key to cache audios, so for different audios, please use a different audio name, otherwise the result might be based on another audio.

Sorry for the inconvenience.

-Yuan