Closed seunggookim closed 6 months ago
In the current gist, essentia.FrameGenerator
is used with startFromZero=false
by default to create a 1-sec audio_chunk
without overlaps, so the first patch starts from -(patchSize/2) to (patchSize/2) so that the mid-patch time is zero. With patchSize=1 sec and patchHopSize=1 sec, the first patch starts at -0.5 sec and ends at 0.5 sec. @palonso, did I figure this out correctly?
In the current implementation, it's unclear how essentia.FrameGenerator
is used. If it is used only once to create frames directly from the audio (without preliminary chunking), the first frame of the first patch is likely to start at -(frameSize/2) = - 12.5 ms. With patchSize = 96 frames, the patch width would be frameSize + 95*frameHopSize = 975 ms. Thus, the mid-patch time of the first patch would be -12.5 + 975/2 = 475 ms. If the essentia.FrameGenerator
is used with startFromZero=true
, these would shift by +(frameSize/2) = 12.5 ms. Which one is correct? I assume it would be startFromZero=false
by default? Or is the first mid-patch time once again zero? Thanks in advance!
OpenL3: Correct!
VGGish:
Correct. We use FrameCutter with the default startFromZero=false
, and we do not center the first patch at zero in this case.
Every frame jumps 160 / 16kHz = 10ms. With the parameter patchHopSize
we control the number of frames to jump between patches, (e.g. patchHopSize=96, 960 ms.). With this, you can compute the timestamps for the successive patches.
Adding timestamps is out of scope of our core algorithms. Perhaps, timestamps could be added to the helper functions in https://github.com/MTG/essentia/tree/master/src/python/essentia/pytools. Closing this for now.
I appreciate it much that Essentia has added interfaces for the pre-trained models (openL3 and VGGish)! I'm interested in relating these embeddings with neural/behavioral timeseries, so having accurate time stamps for each patch (e.g., the center time of each patch) is important to me. I would very much appreciate it if you could make changes so that the functions also return time stamps (another NumPy array) for those two models (openL3 and VGGish) to avoid possible confusion/mistakes when used in other research.