MTG / essentia.js

JavaScript library for music/audio analysis and processing powered by Essentia WebAssembly
https://essentia.upf.edu/essentiajs
GNU Affero General Public License v3.0
645 stars 42 forks source link

Output of EssentiaWorkletProcessor in rms-rt example #60

Closed Kappers closed 3 years ago

Kappers commented 3 years ago

Hey all!

I am working through the rms-rt example, and am unsure what is different between the output values below in essentia-worklet-processor.js:

  process(inputs, outputs, parameters) {
    ...
    let output = outputs[0];
    ...
    let vectorInput = this.essentia.arrayToVector(input[0]);
    ...
    let rmsFrame = this.essentia.RMS(vectorInput) // input audio frame
    output[0][0] = rmsFrame.rms;
    return true;
  }

and those retrieved in index.html by:

        analyserNode.getFloatTimeDomainData(analyserData);
        let rms = analyserData[0];

There doesn't seem to be a one-to-one relationship? So, if I want the exact value of rmsFrame.rms in my top-level drawing function etc, how would I access it? Sorry if I am missing something obvious!

Example: https://github.com/MTG/essentia.js/tree/dev/examples/rms-rt

Thanks!

jmarcosfer commented 3 years ago

Hi @Kappers!

I don't think you're missing anything obvious, you've definitely caught some strange behaviour.

Could you please share what browser you're using? I've seen the same thing happen in Chrome, but not in Firefox. I've already looked into it and have an idea of what it could be, but just to be sure we're talking about the same thing...

Kappers commented 3 years ago

Thanks for taking a look at this @jmarcosfer, I'm pleased that I'm not missing something!

I'm using Chrome (v92.0.4515.107) for this on Mac (10.15.3).

I didn't think Firefox would support AudioWorklets and all the other good stuff required for this demo, so didn't even try.

jmarcosfer commented 3 years ago

Thanks @Kappers,

That's the same version of Chrome I'm using.

Solution:

Create AudioWorkletNode specifying outputChannelCount in its options object. Specifically, change this line as follows:

      super(context, 'essentia-worklet-processor', {
        outputChannelCount: [1]
      });

Problem explanation:

It appears to be an error related to channel management. I'll explain following the audio chain:

a. The microphone stream track that we grab via getUserMedia has channelCount=1, at least in my case when using my laptop's built-in microphone (I'm assuming you're also using a mono mic). b. The MediaStreamAudioSourceNode that uses it infers its channelCount from the channels in the media tracks passed to it (at least, that's what's on the Web Audio API specs, not what actually happens for me on Chrome nor Firefox). c. Then the AudioWorkletNode does the same. Unless you specify outputChannelCount, it infers how many output channels it should have according to the input channels (this is incorrectly set to 2 in Chrome, but not in Firefox, which seems to understand that the media track is mono even if the audio node using it is stereo). This means that, at least in Chrome, we have a stereo AudioWorklet output, with one channel all zeros. There's a bug filed for this here. d. This is finally connected to an AnalyserNode. It turns out that the AnalyserNode downmixes the signal when you retrieve time-domain data via the getFloatTimeDomainData. So what you're seeing is potentially the RMS values computed with essentia, but halved, because they're downmixed with an all-zeros right channel.

Kappers commented 3 years ago

Thanks @jmarcosfer for the deep dive, that appears to have fixed things.

As you mentioned in (c), I had noticed the multiple output channels and was wondering why a boolean value being sent through the first output channel was being squashed to 0.5 after getFloatTimeDomainData!

I appreciate you pointing out the bug filed for this, too.

Thanks again!

Kappers commented 3 years ago

Hey @jmarcosfer, maybe I rushed to close the issue. There seems to be some other problem (?), related to this issue.

In the rms-rt example, I've noticed that EssentiaWorkletProcessor.process is called multiple (6) times for each draw and hence analyserNode.getFloatTimeDomainData call. Clearly, this is problematic!

I've done some digging but truly have no ideas what is the cause of this, maybe it's expected?

jmarcosfer commented 3 years ago

Hi @Kappers,

This is expected. The explanation for this behaviour is simple: AudioWorklet's process method gets called regularly by its dedicated audio thread, and this runs at a much higher rate than the requestAnimationFrame that draw uses.

Typical audio sampling rates in music are 44.1 kHz or 48 kHz, but even 8 kHz (common in telephony) is a lot faster than screen refresh rates (which tend to be around 50 to 60 Hz), which is how fast requestAnimationFrame runs.

Now, any kind of analysis will need a window of samples to operate on, so we don't really have 44100 analysis values per second. At that sampling rate and frame size of 128 (constant for Web Audio unless you implement your own buffering for bigger frame sizes), we actually get ~344 values. This is why you can see about 6 EssentiaWorkletProcessor.process calls for each draw (344/50 = ~6.9 or 344/60 = ~5.7).

This is only a problem if you're interested in seeing/getting all analysis values from the AudioWorklet. For the rms-rt demo this wasn't an issue. Peak or momentary loudness changes very fast, so I was always interested in smoothing the analysis output to get a steady value for visualisation anyway.

If you do want every value (maybe you have some other computation that depends on the first analysis and missing values will significantly affect your math down the line), you can use SharedArrayBuffer. You can check out our realtime melspectrogram example to see how this is used with AudioWorklets. But bear in mind that, for visualisation purposes, you're still limited by how fast your screen refreshes, so it won't make sense to try to get all 344 values/sec (unless your monitor runs at 300+ Hz).

I hope that helps!

Kappers commented 3 years ago

Thanks for clarifying all of this @jmarcosfer, this all makes sense.

Of course a visualisation process is limited in framerate compared to audio, but in my case it isn't appropriate to smooth over the analysis output for the sake of visualisation. The realtime melspectrogram example sounds useful for this - I should have looked at it more closely.

Thank you!