Add AudioSet Tensforflow extractors

adelavega commented 4 years ago

Google published a dataset called AudioSet, which consists of about 2 million 10-second YouTube clips that are manually annotated on a hierarchical ontology. (thanks to @rbroc for finding out about this!)

They then developed and shared TF models for 1) classification of these labels from audio (Yamnet) and 2) 128-dimensional embeddings (VGGish). Both of these models are available with pre-trained networks!

I gave Yamnet a shot on some of our stimuli, and the results are actually pretty good:

Life dataset (60s):

Sherlock:

Generating moment by moment labels is quite fast, it should only take a few seconds to produce labels for an entire movie.

The question is whether this makes sense as a pliers extractor, or something we do outside of pliers, like for facenet. Given that this repo seems fairly well maintained, and extracting for a WAV input is fairly standard, I vote for including in pliers.

The only minor catch is that the audio needs to be sampled at 16000hz. Not sure if we have a filter for downsampling audio, although this is fairly easy to do with various libraries like pydub.

tyarkoni commented 4 years ago

Oh, this looks awesome. Agree it should be in pliers and not just in NeuroScout.

tyarkoni commented 4 years ago

Oh and we don't have a resampling filter right now (at least for audio) AFAIK. But that would be straightforward to add and I think we should make it its own AudioResamplingFilter rather than build it into this one. We could even conceivably add an optional class variable for audio that indicates what sampling rate(s) an Extractor needs, but it's probably overkill and the internal logic for these hierarchies is getting pretty convoluted. So maybe best to just raise an exception if a different sampling rate is provided.

rbroc commented 4 years ago

closing this as extractor has been implemented

PsychoinformaticsLab / pliers

Add AudioSet Tensforflow extractors #370