Open ChristofHenkel opened 6 years ago
no I mean some files in test data are more or less pure silence. I guess you can just "hard" predict them as silence with sum(abs(np.asarray(wav)) < T
where T is some silence threshold (like 10000)
Or what do you think?
Oh I see...That's actually a good idea.. Assuming the data in silence wav file is quite constant, the coefficient values in delta or delta delta MFCC will be far smaller than the wav files that contains speech. Based on that, we can guess a good threshold to sepreate them. If it is necessary, we can also use it as soft filter with some weight.. and combine the result with NN output.. Let me work on this..
we can start with the webrtcvad model in python
https://github.com/wiseman/py-webrtcvad https://www.kaggle.com/holzner/voice-activity-detection-example
right now i use this, which gives only 50% acc on silent wav files, but has 100% acc on non silent ones
import webrtcvad
import struct
vad = webrtcvad.Vad()
def is_silence(wav, vad_mode = 1, speech_portion_threshold = 0.3, window_duration = 0.03):
vad.set_mode(vad_mode) # set aggressiveness from 0 to 3
raw_samples = struct.pack("%dh" % len(wav), *wav)
samples_per_window = int(window_duration * 16000 + 0.5)
bytes_per_sample = 2
speech_analysis = []
for start in np.arange(0, len(wav), samples_per_window):
stop = min(start + samples_per_window, len(wav))
is_speech = vad.is_speech(raw_samples[start * bytes_per_sample: stop * bytes_per_sample],
sample_rate=16000)
speech_analysis.append(is_speech)
speech_port = speech_analysis.count(True)/len(speech_analysis)
return speech_port < speech_portion_threshold
I've implemented simple silence detection based on threshold cut for amplitude envelope, autocorrelation and zero crossing but the accuracy is lower than webrtcvad
I tried another VAD from https://github.com/marsbroshok/VAD-python, acc 90% in hand labeled silence data but sadly only 60% in the non-silence data
@ChristofHenkel how do you define the threshold of silence (t = 10000)? What about speech_portion_threshold = 0.3, how do you define it?
do you mean detecting the silence part of an audio (for example: silence at the beginning of the recording) and then cut it from the wav files? or based on the full length of an audio, decide whether it contains only silence or not?