ChristofHenkel / speech_recognition

1 stars 0 forks source link

hard code pure silence detection and delete from training #3

Open ChristofHenkel opened 6 years ago

mochanz commented 6 years ago

do you mean detecting the silence part of an audio (for example: silence at the beginning of the recording) and then cut it from the wav files? or based on the full length of an audio, decide whether it contains only silence or not?

ChristofHenkel commented 6 years ago

no I mean some files in test data are more or less pure silence. I guess you can just "hard" predict them as silence with sum(abs(np.asarray(wav)) < T where T is some silence threshold (like 10000) Or what do you think?

mochanz commented 6 years ago

Oh I see...That's actually a good idea.. Assuming the data in silence wav file is quite constant, the coefficient values in delta or delta delta MFCC will be far smaller than the wav files that contains speech. Based on that, we can guess a good threshold to sepreate them. If it is necessary, we can also use it as soft filter with some weight.. and combine the result with NN output.. Let me work on this..

ChristofHenkel commented 6 years ago

we can start with the webrtcvad model in python

https://github.com/wiseman/py-webrtcvad https://www.kaggle.com/holzner/voice-activity-detection-example

ChristofHenkel commented 6 years ago

right now i use this, which gives only 50% acc on silent wav files, but has 100% acc on non silent ones

import webrtcvad
import struct
vad = webrtcvad.Vad()

def is_silence(wav, vad_mode = 1, speech_portion_threshold = 0.3, window_duration = 0.03):
    vad.set_mode(vad_mode)   # set aggressiveness from 0 to 3
    raw_samples = struct.pack("%dh" % len(wav), *wav)
    samples_per_window = int(window_duration * 16000 + 0.5)
    bytes_per_sample = 2
    speech_analysis = []
    for start in np.arange(0, len(wav), samples_per_window):
        stop = min(start + samples_per_window, len(wav))
        is_speech = vad.is_speech(raw_samples[start * bytes_per_sample: stop * bytes_per_sample],
                                  sample_rate=16000)
        speech_analysis.append(is_speech)

    speech_port = speech_analysis.count(True)/len(speech_analysis)
    return speech_port < speech_portion_threshold
mochanz commented 6 years ago

I've implemented simple silence detection based on threshold cut for amplitude envelope, autocorrelation and zero crossing but the accuracy is lower than webrtcvad

mochanz commented 6 years ago

I tried another VAD from https://github.com/marsbroshok/VAD-python, acc 90% in hand labeled silence data but sadly only 60% in the non-silence data

Arjola commented 4 years ago

@ChristofHenkel how do you define the threshold of silence (t = 10000)? What about speech_portion_threshold = 0.3, how do you define it?