MycroftAI / mimic-recording-studio

Mimic Recording Studio is a Docker-based application you can install to record voice samples, which can then be trained into a TTS voice with Mimic2
Apache License 2.0
496 stars 114 forks source link

Provide 0.3s buffer for silence trimming #47

Closed krisgesling closed 3 years ago

krisgesling commented 3 years ago

Description

We want to trim the silence from recordings but risk cutting off the first and last syllables, particularly if they are softer sounds.

This adds a reasonable length buffer for data collection. During post-processing the sounds can be further trimmed if desired.

Fixes #35

Type of PR

Testing

Record samples and ensure the first sound is never cut off. You can also tweak the trim_buffer variable to see that it works eg 1000 will have a 1 second silence at the start and end of each recording.

krisgesling commented 3 years ago

@el-tocino - do you think 0.3s is too much of a buffer?

My worry is not providing enough of a buffer and ending up with the same issue. Where as if we are too conservative you still have the raw data and can re-process it.

krisgesling commented 3 years ago

Was just thinking that it would probably be better if we retained both the raw recordings and trimmed versions but this will take a little more refactoring.

amoljagadambe commented 3 years ago
    @staticmethod
    def trim_silence(path: str) -> AudioSegment:
        sound = AudioSegment.from_wav(path + ".wav")
        start_trim = Audio._detect_leading_silence(sound)
        end_trim = Audio._detect_leading_silence(sound.reverse())
        duration = len(sound)
        trimmed_sound = sound[int(start_trim/2):int((duration-end_trim)/2)]
        return trimmed_sound

I used this approach rather than cutting aggressive this will add a half time buffer

Edit: just fixed the code formatting - Kris

el-tocino commented 3 years ago

I would err on the side of caution in this case, as we've seen that trimming too aggressively will definitely cause problems. On the opposite end, having too much silence also causes training issues, however, retaining valid data is a more important need here. Additionally, Coqui has do_trim_silence, which can be set to trim files. I would lean towards a fixed amount of silence personally: trimming things after would be a much simpler task knowing I had N amount of time that would be silent. Automated tools are getting better about this so it's not a big issue if a variable amount is preferred.

krisgesling commented 3 years ago

Hey @amoljagadambe I'm going to merge this simple fix for the moment. It looks like you're doing some interesting additions and are using it for slightly different use cases, so we can keep looking at those but I want to ensure people using it now aren't getting audio cut off.

amoljagadambe commented 3 years ago

np, @krisgesling. I have another approach in mind for this, but it will require re-writing the audio.py file and use librosa , NumPy, and pandas. the approach goes this way we will trim the silence in between also but using sliding windows over the fftfreq format of audio. with the moderate setting of noise ratio.

amoljagadambe commented 3 years ago

and I am using the above fix to record 13300 dictionary audios, so I can assure you this fix will work quite well.