Closed adelavega closed 4 years ago
Oh, this looks awesome. Agree it should be in pliers and not just in NeuroScout.
Oh and we don't have a resampling filter right now (at least for audio) AFAIK. But that would be straightforward to add and I think we should make it its own AudioResamplingFilter
rather than build it into this one. We could even conceivably add an optional class variable for audio that indicates what sampling rate(s) an Extractor
needs, but it's probably overkill and the internal logic for these hierarchies is getting pretty convoluted. So maybe best to just raise an exception if a different sampling rate is provided.
closing this as extractor has been implemented
Google published a dataset called AudioSet, which consists of about 2 million 10-second YouTube clips that are manually annotated on a hierarchical ontology. (thanks to @rbroc for finding out about this!)
They then developed and shared TF models for 1) classification of these labels from audio (Yamnet) and 2) 128-dimensional embeddings (VGGish). Both of these models are available with pre-trained networks!
I gave Yamnet a shot on some of our stimuli, and the results are actually pretty good:
Life dataset (60s):
Sherlock:
Generating moment by moment labels is quite fast, it should only take a few seconds to produce labels for an entire movie.
The question is whether this makes sense as a pliers extractor, or something we do outside of pliers, like for facenet. Given that this repo seems fairly well maintained, and extracting for a WAV input is fairly standard, I vote for including in pliers.
The only minor catch is that the audio needs to be sampled at 16000hz. Not sure if we have a filter for downsampling audio, although this is fairly easy to do with various libraries like pydub.