MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
http://essentia.upf.edu
GNU Affero General Public License v3.0
2.83k stars 530 forks source link

Feature request: time stamps for embedding extractors #1253

Closed seunggookim closed 6 months ago

seunggookim commented 2 years ago

I appreciate it much that Essentia has added interfaces for the pre-trained models (openL3 and VGGish)! I'm interested in relating these embeddings with neural/behavioral timeseries, so having accurate time stamps for each patch (e.g., the center time of each patch) is important to me. I would very much appreciate it if you could make changes so that the functions also return time stamps (another NumPy array) for those two models (openL3 and VGGish) to avoid possible confusion/mistakes when used in other research.

seunggookim commented 1 year ago

openL3

In the current gist, essentia.FrameGenerator is used with startFromZero=false by default to create a 1-sec audio_chunk without overlaps, so the first patch starts from -(patchSize/2) to (patchSize/2) so that the mid-patch time is zero. With patchSize=1 sec and patchHopSize=1 sec, the first patch starts at -0.5 sec and ends at 0.5 sec. @palonso, did I figure this out correctly?

vggish

In the current implementation, it's unclear how essentia.FrameGenerator is used. If it is used only once to create frames directly from the audio (without preliminary chunking), the first frame of the first patch is likely to start at -(frameSize/2) = - 12.5 ms. With patchSize = 96 frames, the patch width would be frameSize + 95*frameHopSize = 975 ms. Thus, the mid-patch time of the first patch would be -12.5 + 975/2 = 475 ms. If the essentia.FrameGenerator is used with startFromZero=true, these would shift by +(frameSize/2) = 12.5 ms. Which one is correct? I assume it would be startFromZero=false by default? Or is the first mid-patch time once again zero? Thanks in advance!

palonso commented 1 year ago

OpenL3: Correct!

VGGish: Correct. We use FrameCutter with the default startFromZero=false, and we do not center the first patch at zero in this case.

Every frame jumps 160 / 16kHz = 10ms. With the parameter patchHopSize we control the number of frames to jump between patches, (e.g. patchHopSize=96, 960 ms.). With this, you can compute the timestamps for the successive patches.

dbogdanov commented 6 months ago

Adding timestamps is out of scope of our core algorithms. Perhaps, timestamps could be added to the helper functions in https://github.com/MTG/essentia/tree/master/src/python/essentia/pytools. Closing this for now.