Open jmgirard opened 5 months ago
Yes, something like that. I´ve made a generic function is.voiced which is part of audio.vadwebrtc and will also work on the output of silero which allows you to be a bit more liberal on small sections e.g. consider non-voiced segments smaller than 1 sec as voiced - we can be a bit liberal on identifying voiced segments and need to remove solely the larger chunks of silences to exclude non-voiced hallucinations. See https://github.com/bnosac/audio.vadwebrtc/blob/0ca8192268b74f37a5b068775bad2f596cd339d5/R/vad.R#L133
Examples at https://github.com/bnosac/audio.whisper/blob/cbf7c004495f6a4682a494cc5b560057e72e2991/R/whisper.R#L216
Where these sections or offset/ durations can come from a VAD model or the result of is.voiced.
I also tend to prefer the offset/duration arguments instead of the sections argument as it tends to be able to recover in a new section if a previous section had repetitions.
Probably the VAD can be used for better diarization as well if we do VAD by channel and see which section in the transcription corresponds to voiced elements as detected by the VAD
Ok, good to know. I'll try offset
and duration
instead.
Hmm... offset
and duration
seems to be running a lot slower than without that. Has that been your experience too?
For this you need to understand that whisper runs in chunks of 30seconds.
The behaviour is different for the 2 arguments.
Feel free to provide feedback how the transcription works on your audio.
Gotcha. In my use case, sections took about 20m to run one file whereas offset/duration took several hours. The output for sections looks good so far, but I'll do a more thorough check once more files are processed.
Would be good if you can test if the timepoints on the output when using sections are ok.
Regarding speed that's normal. With sections you basically remove the non-voiced audio (so it will be faster than transcribing the full audio file). Probably you feeded the VAD directly in there but the VAD can provide many small chunks, it makes sense to combine these a bit. Function is.voiced which is part of audio.vadwebrtc (and works on output of silero as well) combines a bit larger chunks - https://github.com/bnosac/audio.vadwebrtc/blob/master/R/vad.R#L126-L184
Ok, is.voiced()
is useful. Trying this now:
convert_silero <- function(vadfile, smin = 500, vmin = 500) {
# Extract segments information
sections <- audio.vadwebrtc::is.voiced(
readRDS(vadfile),
units = "milliseconds",
silence_min = smin,
voiced_min = vmin
)
# Drop non-voiced segments
out <- sections[sections$has_voice == TRUE, ]
# Format output
out <- out[, c("start", "duration")]
return(out)
}
transcribe_file <- function(infile, outfile, vadfile, approach = "offsets", ...) {
approach <- match.arg(approach, choices = c("sections", "offsets"), several.ok = FALSE)
if (file.exists(outfile)) {
return("skipped")
}
vad_segments <- convert_silero(vadfile, ...)
switch (approach,
sections = {
transcript <- predict(
model,
infile,
type = "transcribe",
language = "en",
n_threads = 1,
n_processors = 1,
sections = vad_segments,
trace = FALSE
)
},
offsets = {
transcript <- predict(
model,
infile,
type = "transcribe",
language = "en",
n_threads = 1,
n_processors = 1,
offset = vad_segments$start,
duration = vad_segments$duration,
trace = FALSE
)
}
)
saveRDS(transcript, file = outfile, compress = "gzip")
return("created")
}
Probably the VAD can be used for better diarization as well if we do VAD by channel and see which section in the transcription corresponds to voiced elements as detected by the VAD
Added predict.whisper_transcription for this in version 0.4.1
Does this look correct for what
predict.whisper()
is looking for in itssections
argument?