dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
761 stars 75 forks source link

/notebooks/automatic_model_training.ipynb seems to cause bad clipping of some of the resultant files #74

Closed StuartIanNaylor closed 1 year ago

StuartIanNaylor commented 1 year ago

I was just browsing the files and noticed some are badly clipped which likely will effect. Haven't drilled down and don't no if this example notebook is up to date, but happens in this section likely when the Rirs are applied

# Download room impulse responses collected by MIT
# https://mcdermottlab.mit.edu/Reverb/IR_Survey.html

output_dir = "./mit_rirs"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    rir_dataset = datasets.load_dataset("davidscripka/MIT_environmental_impulse_responses", split="train", streaming=True)

    # Save clips to 16-bit PCM wav files
    for row in tqdm(rir_dataset):
        name = row['audio']['path'].split('/')[-1]
        scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))

    # Convert audioset files to 16khz sample rate
    audioset_dataset = datasets.Dataset.from_dict({"audio": [str(i) for i in Path("audioset/audio").glob("**/*.flac")]})
    audioset_dataset = audioset_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000))
    for row in tqdm(audioset_dataset):
        name = row['audio']['path'].split('/')[-1].replace(".flac", ".wav")
        scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))

# Free Music Archive dataset (https://github.com/mdeff/fma)
output_dir = "./fma"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
    fma_dataset = datasets.load_dataset("rudraml/fma", name="small", split="train", streaming=True)
    fma_dataset = iter(fma_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000)))

    n_hours = 1  # use only 1 hour of clips for this example notebook, recommend increasing for full-scale training
    for i in tqdm(range(n_hours*3600//30)):  # this works because the FMA dataset is all 30 second clips
       row = next(fma_dataset)
       name = row['audio']['path'].split('/')[-1].replace(".mp3", ".wav")
       scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))
       i += 1
       if i == n_hours*3600//30:
           break

output_dir = "./audioset_16k"
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

    # Convert audioset files to 16khz sample rate
    audioset_dataset = datasets.Dataset.from_dict({"audio": [str(i) for i in Path("audioset/audio").glob("**/*.flac")]})
    audioset_dataset = audioset_dataset.cast_column("audio", datasets.Audio(sampling_rate=16000))
    for row in tqdm(audioset_dataset):
        name = row['audio']['path'].split('/')[-1].replace(".flac", ".wav")
        scipy.io.wavfile.write(os.path.join(output_dir, name), 16000, (row['audio']['array']*32767).astype(np.int16))
dscripka commented 1 year ago

Hmm, this code is just downloading and converting files, it doesn't yet apply them to the generated example clips for data augmentation.

Are the clipping issues happening with these downloaded files? Which dataset (RIRs, FMA, or Audioset) seems to have the clipped files?

StuartIanNaylor commented 1 year ago

On completion in the audioset_16k folder not all and its a bit of a pain finding them but some are heavily clipped as if *32767 overflowed. Not sure if just my setup as was pulling apart the notebook and running locally as having that Cuda mismatch thing that always happens on Ubuntu with driver updates, that is likely an older version of tensorflow and a RTX3050 card. It has been strange as I upped the qty to 100 hours and started to and I have forgot but thought thats strange as it was ffmeg or some other. I will switch back OSes and try a cleaner install. Is there a better local install script than maybe the notebook or colab examples? It might be better employing the nvidia docker container and may have a go at setting one up than fighting with cuda and TF versions.

dscripka commented 1 year ago

Ah, I see. When the datasets library pulls the audio data it should all be within (-1, 1), so there shouldn't be overflow issues. But the audioset library is very diverse (originally came from Youtube videos) and there might be some recordings that are naturally clipped. As long as it's relatively rare it shouldn't have too much of an impact on the model training.

But it's likely a good idea to check before converting to 16-bit PCM format to not add any unnecessary clipping.

StuartIanNaylor commented 1 year ago

I will have look and see where it happens or is original

StuartIanNaylor commented 1 year ago

I created nvidia docker container with pytorch run the pip installs and think its likely happening again. Its in a container so browsing is less easier but I am getting ocassionally '[src/libmpg123/layer3.c:INT123_do_layer3():1841] error: dequantization failed!` which is weird with also a fresh download of all the data. I don't think something is all that happy when running that part of the script. Its also very slow compared to the cli version of sox or even pysox, which does warn when clipping. This is now in docker with newdata unless something quirky is going on with my pc?

StuartIanNaylor commented 1 year ago

@dscripka I am thinking maybe on my machine I have some bad ram maybe as the dataset function pulls all into ram I am guessing didn't check htop. The more hours I increase the more errors I seem to get when on 100 hours it actually fails complaing about sound device. I get what it is doing and previously I just use sox and likely will just do the whole lot once with Sox and not hold the dataset in memory. I have been impressed how well openwakeword works and was thinking about have a look at how the internals work. I was just thinking about Silero as my fickle mind thought hold old this is practically VAD (speech embedding) and that super low load version could be created by the positive/negative being data without speech (prob singing as well)/speech and was just slowly scratching the surface to give it a test as even if its not as accurate as silero the head classifications have near no load...

For noise datasets I have used before and thought I would share. http://downloads.tuxfamily.org/pdsounds/pdsounds_march2009.7z https://zenodo.org/records/2529934 https://github.com/karoldvl/ESC-50/archive/master.zip https://urbansounddataset.weebly.com/urbansound8k.html https://zenodo.org/records/5117901 https://zenodo.org/records/4060432 https://github.com/microsoft/MS-SNSD (which is quite good as has tools to mix clean & noise at SNR) https://zenodo.org/records/1227121(multi channel)

Likely I have gone through and painstakingly curated a noise dataset maybe x10 whilst playing with various models and lost over time and have wondered if its worth going full out geek with some sort of resnet to create a large curated balanced noise dataset. With MFCC based classification models what you pick up in the noise dataset can have a big impact on overall accuracy and thinking maybe sharing a curated 'Noise' dataset might be beneficial as many seem to focus on certain types of noise, do contain voice and with credits to source than like I have done loads of work and then deleted maybe some sort of repo would be beneficial to the community?

PS have you ever tested out of curiosity if a head model can detect say fire or water?

dscripka commented 1 year ago

@dscripka I am thinking maybe on my machine I have some bad ram maybe as the dataset function pulls all into ram I am guessing didn't check htop. The more hours I increase the more errors I seem to get when on 100 hours it actually fails complaing about sound device. I get what it is doing and previously I just use sox and likely will just do the whole lot once with Sox and not hold the dataset in memory. I have been impressed how well openwakeword works and was thinking about have a look at how the internals work. I was just thinking about Silero as my fickle mind thought hold old this is practically VAD (speech embedding) and that super low load version could be created by the positive/negative being data without speech (prob singing as well)/speech and was just slowly scratching the surface to give it a test as even if its not as accurate as silero the head classifications have near no load...

For noise datasets I have used before and thought I would share. http://downloads.tuxfamily.org/pdsounds/pdsounds_march2009.7z https://zenodo.org/records/2529934 https://github.com/karoldvl/ESC-50/archive/master.zip https://urbansounddataset.weebly.com/urbansound8k.html https://zenodo.org/records/5117901 https://zenodo.org/records/4060432 https://github.com/microsoft/MS-SNSD (which is quite good as has tools to mix clean & noise at SNR) https://zenodo.org/records/1227121(multi channel)

Likely I have gone through and painstakingly curated a noise dataset maybe x10 whilst playing with various models and lost over time and have wondered if its worth going full out geek with some sort of resnet to create a large curated balanced noise dataset. With MFCC based classification models what you pick up in the noise dataset can have a big impact on overall accuracy and thinking maybe sharing a curated 'Noise' dataset might be beneficial as many seem to focus on certain types of noise, do contain voice and with credits to source than like I have done loads of work and then deleted maybe some sort of repo would be beneficial to the community?

PS have you ever tested out of curiosity if a head model can detect say fire or water?

Bad RAM could certainly cause this, I actually had very similar issues in the past (jupyter kernels crashing, long data preprocessing scripts failing randomly, conda environments corrupting, etc.) and it turned out to be faulty RAM.

Thank you for sharing the links to noise data. This is a good list and I've also experimented with many of these datasets as well. To the question of whether a heavily curated noise dataset would help, I'm actually not sure as from my testing (though admittedly not particularly rigorous testing) both curation and simple scale both seem to improve performance. For example, if I use clips from the ACAV100M dataset as background noise, this also works quite well.

Not sure what you mean by detect "fire" or "water"? Do you mean creating models based on those spoken words?

StuartIanNaylor commented 1 year ago

Maybe its my Ram 32gb but never noticed before, I will prob just Sox them once

I was just wondering as we are talking embedding and we had the samples of water / fire in a dataset could you have a head model for fire / water noises... Same as also with the right dataset where negatives are noise and positive speech if there is no singing that likely would make a great lite head VAD than Silero.

There is a filter that likely on the hardware things seemed aimed at DTLN prob would make a good candidate. https://github.com/SaneBow/PiDTLN Sanebow did a great job as even though you have Speex NS one of the reasons for PiDTLN because Speex NS is really showing its age. It would be interesting as many filter leave atifacts so really you want to preprocess a noisy dataset to create the resultant dataset.