YuanGongND / whisper-at

Code and Pretrained Models for Interspeech 2023 Paper "Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong Audio Event Taggers"
BSD 2-Clause "Simplified" License
321 stars 27 forks source link

Can this be used to mute non speech parts of an audio? #27

Open orionflame opened 7 months ago

orionflame commented 7 months ago

Hi,

I have a lot of narration done by myself for a tutorial that I made so I am trying to clean up the audio files to remove anything non speech related which is majority throat clearing, etc. Here is a very short sample:

https://www.dropbox.com/scl/fi/kotmse874x4rsi86kr8f8/voice3.mp3?rlkey=l5m56g5axort1ru70goo3rvch&dl=1

I couldn't install this library locally yet due to some dependency errors so I used the huggingface version (time res = 1.6) and got this:

0.0s-6.9s: pretty much everything you could want that occur around the normal vector not 6.9s-13.3s: along it. Keenan Crane is one of the leading 13.3s-17.2s: researchers in computational geometry.

So the first thing that popped up is I said Keenan 3 times which were retakes so they normally shouldn't exist except the last one. You can see this in the audio. Is this library also doing de-duplication of words?

For tags I got these: 0.0s-1.6s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking 1.6s-3.2s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking 3.2s-4.8s: Speech, Inside, small room, Clicking, Speech synthesizer, Narration, monologue 4.8s-6.4s: Speech, Narration, monologue, Speech synthesizer, Male speech, man speaking 6.4s-8.0s: Speech, Narration, monologue, Clicking, Speech synthesizer, Inside, small room 8.0s-9.6s: Speech, Clicking, Inside, small room 9.6s-11.2s: Speech, Clicking, Inside, small room, Narration, monologue, Male speech, man speaking 11.2s-12.8s: Speech, Speech synthesizer 12.8s-14.4s: Sine wave 14.4s-16.0s: Sine wave, Hum, Chime, White noise, Boiling

How can I use these tags to only let speech to exist? I already wrote the code that mutes any parts between words that uses timestamps. I tried whisper but it still kept coughing, throat clearing parts.

I tried whisperHallu but that also had some issues cropping some words halfway.

All I need is to keep only the speech parts. After this I will have to figure out a way to remove retakes which sometimes it's one word but sometimes it's half a sentence repeated multiple times but it's always the last one that would be kept.

Any ideas?

dgoryeo commented 3 months ago

Hi @orionflame , did you by anychance found a solution to your question?

orionflame commented 3 months ago

Hi @orionflame , did you by anychance found a solution to your question?

Unfortunately no. You have any leads.

dgoryeo commented 3 months ago

I was wondering if one can distill (527-class AudioSet labels) to much smaller list of events, say less than 10 to be used for this method:

audio_tag_result = whisper.parse_at_label(result, language='follow_asr', top_k=5, p_threshold=-1, include_class_list=list(range(527)))