MTG / essentia

C++ library for audio and music analysis, description and synthesis, including Python bindings
http://essentia.upf.edu
GNU Affero General Public License v3.0
2.8k stars 525 forks source link

StartStopCut not working #1358

Closed Galvo87 closed 10 months ago

Galvo87 commented 1 year ago
in_cue, out_cue = es.StartStopCut(
    frameSize=1024,
    hopSize=512
)(audio)

always returns (0,0), no matter which audio filetype is given as input...

I load the audio using MonoLoader like this:

    audio = es.MonoLoader(filename=str(filepath), sampleRate=44100, resampleQuality=0, downmix="mix")()

I notice that other similar algos like SilenceRate and StartStopSilence have the same behaviour with Python bindings... Am I doing something wrong?

Galvo87 commented 1 year ago

Any update on this one please?

palonso commented 1 year ago

@Galvo87, note that StartStopSilence expects to process a sequence of frames, not the full audio vector. Check this Python example

Galvo87 commented 1 year ago

Thanks, so I cannot work directly with Loaders, but I must generate frames with FameGenerator or similar? I suppose I must input frame (vector_real) (the actual input audio frames) to StartStopSilence then... what is the best approach?

palonso commented 1 year ago

By checking StartStopCut's doc I realized your result could be correct. An output of (0, 0) means that there are no cuts at the start nor the end of the audio (i.e., there is no audio signal in the first 10 nor the last 10 ms of audio). This means that your audio has a healthy silence margin.

You can check that the algorithm works correctly by trying the opposite case, for example with a continuous noise signal:

import numpy as np
from essentia.standard import StartStopCut

audio = np.random.randn(44100)  # one second of noise
print(StartStopCut()(audio))
>>> (1,1)

note that the output should be interpreted as pair of flags: contains cut or not

Galvo87 commented 1 year ago

Ok got it, thanks. What about algos like StartStopSilence that expect frame (vector_real) as input? Is a FrameGenerator necessary in that case? Also algos like FadeDetection, that expects a rms values array...

palonso commented 1 year ago

From StartStopSilence's doc:

Note: In standard mode the algorithm is to be run iteratively on a sequence of frames. The outputs are updated on each iteration, and the final result is produced at the end of the sequence.

For example:

from essentia.standard import MonoLoader, FrameGenerator, StartStopSilence

audio = MonoLoader(filename="your_audio.mp3")()
startStopSilence = StartStopSilence()

for frame in FrameGenerator(audio):
    start, stop = startStopSilence(frame)

print("start:", start, "stop:", stop)
start: 81 stop: 11061

note that the output is frame indices, not seconds.

For FadeDetection, the algorithm expects a vector of RMS values, computed with a frame rate of 4 frames per second by default:

from essentia.standard import MonoLoader, FrameGenerator, FadeDetection, RMS

audio = MonoLoader(filename="rock.mp3")()
rms = RMS()
fadeDetection = FadeDetection()

rms_values = [rms(frame) for frame in FrameGenerator(audio, frameSize=11025, hopSize=11025)]
fade_ins, fade_outs = fadeDetection(rms_values)

print("fade-ins:", fade_ins, "fade-outs:", fade_outs)
fade-ins: [] fade-outs: [[122. 129.]

No that in my example, the algorithm did not detect any fade-in.

The output matrices are:

Galvo87 commented 1 year ago

Thank you, understood. Is there any handy Essentia method for converting frames to actual audio timestamps?

Also, is it possible for StartStopSilence to have ms precision?

palonso commented 1 year ago

Frames to seconds: frame_index * hop_size / sample_rate Default hop_size: 512 Default sample_rate: 44100

StartStopSilence's resolution depends on the analysis hop size. By default, 1000 * 512 / 44100 = 11.6ms, you can increase the time resolution by decreasing the analysis hop size.