Open orionflame opened 7 months ago
Hi @orionflame , did you by anychance found a solution to your question?
Hi @orionflame , did you by anychance found a solution to your question?
Unfortunately no. You have any leads.
I was wondering if one can distill (527-class AudioSet labels) to much smaller list of events, say less than 10 to be used for this method:
audio_tag_result = whisper.parse_at_label(result, language='follow_asr', top_k=5, p_threshold=-1, include_class_list=list(range(527)))
Hi,
I have a lot of narration done by myself for a tutorial that I made so I am trying to clean up the audio files to remove anything non speech related which is majority throat clearing, etc. Here is a very short sample:
https://www.dropbox.com/scl/fi/kotmse874x4rsi86kr8f8/voice3.mp3?rlkey=l5m56g5axort1ru70goo3rvch&dl=1
I couldn't install this library locally yet due to some dependency errors so I used the huggingface version (time res = 1.6) and got this:
0.0s-6.9s: pretty much everything you could want that occur around the normal vector not 6.9s-13.3s: along it. Keenan Crane is one of the leading 13.3s-17.2s: researchers in computational geometry.
So the first thing that popped up is I said Keenan 3 times which were retakes so they normally shouldn't exist except the last one. You can see this in the audio. Is this library also doing de-duplication of words?
For tags I got these: 0.0s-1.6s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking 1.6s-3.2s: Speech, Narration, monologue, Speech synthesizer, Clicking, Male speech, man speaking 3.2s-4.8s: Speech, Inside, small room, Clicking, Speech synthesizer, Narration, monologue 4.8s-6.4s: Speech, Narration, monologue, Speech synthesizer, Male speech, man speaking 6.4s-8.0s: Speech, Narration, monologue, Clicking, Speech synthesizer, Inside, small room 8.0s-9.6s: Speech, Clicking, Inside, small room 9.6s-11.2s: Speech, Clicking, Inside, small room, Narration, monologue, Male speech, man speaking 11.2s-12.8s: Speech, Speech synthesizer 12.8s-14.4s: Sine wave 14.4s-16.0s: Sine wave, Hum, Chime, White noise, Boiling
How can I use these tags to only let speech to exist? I already wrote the code that mutes any parts between words that uses timestamps. I tried whisper but it still kept coughing, throat clearing parts.
I tried whisperHallu but that also had some issues cropping some words halfway.
All I need is to keep only the speech parts. After this I will have to figure out a way to remove retakes which sometimes it's one word but sometimes it's half a sentence repeated multiple times but it's always the last one that would be kept.
Any ideas?