elixir-nx / bumblebee

Pre-trained Neural Network models in Axon (+ 🤗 Models integration)
Apache License 2.0
1.27k stars 90 forks source link

Transcription "failed" for small file (both tiny and base models tried) #377

Closed noozo closed 1 week ago

noozo commented 1 week ago

When using a whisper serving with batched_run (or even just run), we get really bad results for the attached sample mp3 (and any other small audio files we give it):

[
  %{
    text: " My thought I",
    start_timestamp_seconds: 0.0,
    end_timestamp_seconds: nil
  }
]

When trying to run the same transcription model in python, for instance, the results are much better:

Segmented Transcription:
[0.00s - 3.02s]  My thought I have nobody by a beauty and will as you poured.
[3.86s - 9.82s]  Mr. Rochester is sub in that so-don't find simplest, and devoted about, to let might in
[9.82s - 9.94s]  a

What could be the problem here? Some configuration we are missing, for example?

Our serving is defined as such:

defp initialize_whisper_tiny_serving do
      Nx.default_backend(EXLA.Backend)

      {:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-base"})
      {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-base"})
      {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-base"})
      {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-base"})

      Bumblebee.Audio.speech_to_text_whisper(
        whisper,
        featurizer,
        tokenizer,
        generation_config,
        compile: [batch_size: 4],
        defn_options: [compiler: EXLA],
        chunk_num_seconds: 20,
        # context_num_seconds: 5,
        timestamps: :segments
      )
    end

We also observed that passing a really low chunk_num_seconds (like 2), gives better results, but we wonder how taxing (or not) that will really be.

[
  %{
    text: " My thought I have nobody by a beauty and will as you twod. Thank you, Pord. Mr. Rochester is sub and that's so don't and devoted about. about to what might in a row.",
    start_timestamp_seconds: 0.0,
    end_timestamp_seconds: 10.053625000000002
  }
]

Our use case is to have a video call with many participants and we can record individual audio files for each, but would like to then be able to break each of these down into small chunks so that we segment them and combine everything into a timeline of people talking (in turns). Maybe our approach is naive, as there are services and models to do diarization, but since we have access to each speaker's audio individually, it seemed like a good initial approach to the problem.

noozo commented 1 week ago

sample.mp3.zip

josevalim commented 1 week ago

Can you please provide a small notebook that reproduces the failure? You can use the smart cell if necessary to help you bootstrap it.

noozo commented 1 week ago

untitled_notebook.livemd.zip

noozo commented 1 week ago

the livebook assumes the sample.mp3 file at /tmp/room_1/sample.mp3 and ffmpeg installed on the system. thanks for looking into this :)

jonatanklosko commented 1 week ago

It is something wrong with batching.

@noozo you can try compile: [batch_size: 1] until this is fixed.

jonatanklosko commented 1 week ago

That was a regression on main, fixed in 3b0eb08716911d97a7ff1cbfcfbf357eaf5e95e3!

noozo commented 1 week ago

Hi @jonatanklosko, thanks for looking into this. It seems that fixes the chunks issue, although i have to argue that bumblebee's results feel worse than pythons with the same (base) model. Here's an example generated in both with the same audio file:

Python:

Segmented Transcription:
[1.08s - 2.02s]  Hello!
[3.36s - 13.36s]  And now, recording, hopefully, tell me, what was the last time you went cycling?
[18.70s - 22.76s]  And was that a big run?
[27.94s - 28.80s]  Well, that's good.
[30.42s - 32.44s]  That's what? Two hours? One hour?
[41.00s - 46.66s]  No, it's not. Yeah.
[48.92s - 54.72s]  Well, depends on the type of track that you do. If you do downhill and stuff like that, it will be harder.
[65.54s - 71.62s]  Yeah. How about you, because I was the last exercise?
[75.12s - 81.48s]  Okay. Lifting weights and stuff. Very good.
[83.26s - 87.14s]  Doing some protein, creatine, or not.
[87.14s - 93.16s]  But, just pure lifting.
[94.84s - 96.90s]  Roaching and spinach, right?
[99.60s - 100.88s]  Yeah, through.
[101.76s - 102.60s]  Through, through, through.
[104.26s - 104.74s]  Cool.
[106.22s - 107.14s]  Cool.
[112.74s - 115.68s]  Cool. All right. Thank you.
[116.50s - 119.06s]  And see you. Bye-bye.
[119.52s - 120.20s]  Bye-bye.

Bumblebee:

[
  %{
    text: " Hello. Hello! Thank you. recording recording hopefully Thank you. Tell me what was What was the last time you... you and cycling. Please. Thank you. And... I was very happy to be here. Was that a big? A big run. Bye. Thank you. Let's That's fine. That's about two hours, one hour. Thank you. Bye bye. Thank you.",
    start_timestamp_seconds: 0.0,
    end_timestamp_seconds: 41.0
  },
  %{
    text: " Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. No, it's not, yeah. Not even. Not easy, well depends. Well depends, depends on the type of track that you do downhill and stuff like that will be harder. Thank you. Thank you. you you Thank you. Yeah. How about you, Kazan? Thank you for watching, I was with you. So that's exercise and get it. Thank you. Well. Okay. Okay, lifting weights and stuff. Very nice. Yes, very good. doing some, you know protein creates team, create team or not. Thank you. Pure Just pure... just pure lifting. Right. Roos, Roos, je kan je en spin het. and spinach is right. Thank you. Yeah. Yeah, through. Thank you. Cool. Thank you. Yeah. Hmm. Cool. All right. Thank you. and the Bye bye.",
    start_timestamp_seconds: 41.0,
    end_timestamp_seconds: 120.07
  },
  %{
    text: " Good job.",
    start_timestamp_seconds: 120.07,
    end_timestamp_seconds: 120.8437499999996
  }
]

Bumblebee seems to kind of repeat a lot of things with a chunk_num_seconds = 2.

If i change that to 20s instead i still get:

[
  %{text: " Hello!", start_timestamp_seconds: 0.0, end_timestamp_seconds: 2.0},
  %{
    text: " And now, recording, hopefully, tell me what was the last time you went cycling. And was that a big run?",
    start_timestamp_seconds: 2.0,
    end_timestamp_seconds: 23.33
  },
  %{
    text: " That's good. That's about two hours, one hour.",
    start_timestamp_seconds: 23.33,
    end_timestamp_seconds: 33.67
  },
  %{
    text: " No, it's not easy. Well, depends on the type of track that you do.",
    start_timestamp_seconds: 49.28,
    end_timestamp_seconds: 52.08
  },
  %{
    text: " If you do downhill and stuff like that, it will be harder.",
    start_timestamp_seconds: 52.08,
    end_timestamp_seconds: 68.69
  },
  %{
    text: " How about you cousin? I was the last exercise. You did.",
    start_timestamp_seconds: 71.67,
    end_timestamp_seconds: 72.67
  },
  %{
    text: " What.",
    start_timestamp_seconds: 72.67,
    end_timestamp_seconds: 75.67
  },
  %{
    text: " Okay, lifting weights and stuff.",
    start_timestamp_seconds: 75.67,
    end_timestamp_seconds: 79.67
  },
  %{
    text: " Very, very good.",
    start_timestamp_seconds: 79.67,
    end_timestamp_seconds: 82.67
  },
  %{
    text: " Doing some, you know, protein, creatine or not pure lifting.",
    start_timestamp_seconds: 82.67,
    end_timestamp_seconds: 93.6
  },
  %{
    text: " Roaching and spinach, right. Yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, yeah, Yeah. Cool. All right. Thank you. And the see you. Bye bye.",
    start_timestamp_seconds: 93.6,
    end_timestamp_seconds: 119.14699999999998
  }
]
noozo commented 1 week ago

For clarity, those "yeah yeah yeah yeah yeah's" are not in the audio, seems like a repetition bug of some kind, akin to what happens with 2s chunk_num_seconds.

noozo commented 1 week ago

Python's version, however, is almost on point (i didn't say "through", i said "true", but that is basically not identifying my poor Portuguese accent :))

jonatanklosko commented 1 week ago

2s chunks is too short. The chunking is primarily meant for long-form translation, because the model can handle 30s max, so for longer audios we need to chunk. I would try chunk_num_seconds: 30, timestamps: :segments.

FTR which Python implementation do you use, transformers or openai-whisper?

noozo commented 1 week ago

Good question, here's the script i'm using:

import whisper
import os

# Set the path to the audio file
# audio_path = "/tmp/room_1/recording.mp3"
audio_path = "/tmp/room_2/2c77dead-454d-44fc-8669-54f88193a49f_1720028217.mp3"
# audio_path = "/tmp/room_2/eab8e01c-26fe-408e-91e2-a938be493632_1720028216.mp3"
# audio_path = "/tmp/room_2/eb5677eb-dc2f-469d-a25b-d27c748c667f_1720028216.mp3"

# Check if the file exists
if not os.path.exists(audio_path):
    print(f"Error: The file {audio_path} does not exist.")
    exit(1)

# Load the Whisper model
model = whisper.load_model("base")

# Transcribe the audio with word-level timestamps
print("Transcribing audio...")
result = model.transcribe(audio_path, word_timestamps=True)

# Print the segmented transcription with timestamps
print("\nSegmented Transcription:")
for segment in result["segments"]:
    start_time = segment["start"]
    end_time = segment["end"]
    text = segment["text"]
    print(f"[{start_time:.2f}s - {end_time:.2f}s] {text}")

# Optionally, print word-level timestamps
# print("\nWord-level Timestamps:")
# for segment in result["segments"]:
#     for word in segment["words"]:
#         print(f"[{word['start']:.2f}s - {word['end']:.2f}s] {word['word']}")
noozo commented 1 week ago

Seems to be https://pypi.org/project/openai-whisper/

noozo commented 1 week ago

Ok, so 30s seems to provider results in line with python's (ie no crazy yeah yeah repetition):

[
  %{text: " Hello!", start_timestamp_seconds: 0.0, end_timestamp_seconds: 2.0},
  %{
    text: " And now, recording, hopefully, tell me, what was the last time you went cycling?",
    start_timestamp_seconds: 2.0,
    end_timestamp_seconds: 18.48
  },
  %{
    text: " And was that a big run?",
    start_timestamp_seconds: 18.48,
    end_timestamp_seconds: 28.0
  },
  %{
    text: " Oh, that's good. That's what, two hours, one hour?",
    start_timestamp_seconds: 30.0,
    end_timestamp_seconds: 33.0
  },
  %{
    text: " No, it's not easy. Well, depends depends on the type of track that you do. If you do downhill and stuff like that, it will be harder. Yeah, how about you, because then I was the last exercise.",
    start_timestamp_seconds: 52.96,
    end_timestamp_seconds: 75.0
  },
  %{
    text: " Okay, lifting weights and stuff. Very good. Doing some protein, creatine or not pure, just pure lifting.",
    start_timestamp_seconds: 75.0,
    end_timestamp_seconds: 93.48
  },
  %{
    text: " Roaching and spinach, right? Yeah, through, through through. Cool.",
    start_timestamp_seconds: 93.48,
    end_timestamp_seconds: 109.6
  },
  %{
    text: " Yeah. Cool. All right. Thank you. And see you. Bye bye.",
    start_timestamp_seconds: nil,
    end_timestamp_seconds: 119.0
  }
]
noozo commented 1 week ago

Follow up question, why does python segment it more than bumblebee? (8 in bumblebee vs 19 in python)

jonatanklosko commented 1 week ago

Nice, the bigger chunk we can feed into the model the better, especially in cases where there are periods of silence, and 30s is the maximum that Whisper handles.

Follow up question, why does python segment it more than bumblebee? (8 in bumblebee vs 19 in python)

The openai-whisper implementation uses different approach for chunking and merging. We went with an approach that transformers use, which allows for processing chunks in parallel. The technique involves overlapping audio chunks, so there is overlap in the text, which we deduplicate on merging. But one consequence is that we need to skip some of the segment markers that the model outputs, I think that's why there is less segmentation.

noozo commented 1 week ago

Fair enough, but i just got another "problem". Another audio sample we have generates this using openai-whisper (python):

Segmented Transcription:
[0.00s - 0.56s]  Hello.
[2.48s - 3.38s]  Hello.
[7.32s - 8.22s]  Hello.
[8.54s - 9.38s]  Hello, hello, hello, hello.
[9.38s - 11.38s]  Hello.
[41.20s - 42.82s]  50 kilometers a lot.
[42.96s - 44.28s]  Now I have an idea.
[47.06s - 47.80s]  Is it easy?
[72.72s - 73.88s]  Yes, today.
[74.94s - 77.18s]  I just went to the gym yesterday.
[81.06s - 81.46s]  Yeah.
[86.94s - 88.40s]  No, no.
[92.68s - 94.52s]  Just raw chicken.
[97.40s - 97.96s]  Yeah.
[104.64s - 106.24s]  I mean, I have protein.
[106.36s - 108.22s]  I put it in yogurt, but that's it.
[110.30s - 111.04s]  It's good.
[117.56s - 118.28s]  Bye bye.

But then Bumblebee seems to hallucinate some cyrilic characters:

[
  %{text: " Hello.", start_timestamp_seconds: 0.0, end_timestamp_seconds: 2.0},
  %{text: " Hello.", start_timestamp_seconds: 2.0, end_timestamp_seconds: 4.0},
  %{
    text: " Hello, hello, hello, hello. Исифтики ламитр золот на векijk, evanidee.",
    start_timestamp_seconds: 8.0,
    end_timestamp_seconds: 45.0
  },
  %{
    text: " Is it easy? Wszyscy nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w tym, że nie są w Yeah.",
    start_timestamp_seconds: 47.0,
    end_timestamp_seconds: 87.0
  },
  %{
    text: " No, no.",
    start_timestamp_seconds: 89.0,
    end_timestamp_seconds: 93.0
  },
  %{
    text: " Just raw chicken.",
    start_timestamp_seconds: 95.0,
    end_timestamp_seconds: 98.0
  },
  %{
    text: " Yeah. I mean, I have protein. I put it in yogurt, but that's it.",
    start_timestamp_seconds: 101.0,
    end_timestamp_seconds: 110.8
  },
  %{
    text: " It's good.",
    start_timestamp_seconds: 110.8,
    end_timestamp_seconds: 111.8
  },
  %{
    text: " Bye-bye.",
    start_timestamp_seconds: 111.8,
    end_timestamp_seconds: 118.8
  }
]
jonatanklosko commented 1 week ago

I'm pretty sure the hallucination is because there are some chunks that are basically silent. Perhaps we could incorporate some silence detection into the current algorithm. I will open up and issue to explore this in the future.