Spectrogram processing #288

Open RobinSchmidt opened 4 years ago

RobinSchmidt commented 4 years ago

i open a new thread to continue the discussion which started here:

but doesn't really belong to the main topic of this thread. there's a class rsSpectrogram for converting an array of audio-samples into a spectrogram, represented as matrix of complex numbers and/or (re)synthesize an audio signal from such a spectrogram. in between analysis and resynthesis, one can apply arbitrary transformations to the spectrogram data. one of the simplemost things to do is filtering by zeroing out the bins above or below a certain cutoff point. the function spectrogramFilter in the TestsRosicAndRapt project demonstrates how this can be done. writing this test function, i discovered a flaw in the underlying matrix class which unfortunately requires some inconvenient additional copying workaround (there are comments about this in the code). i think, i will soon replace this matrix class - which i wanted to do since some time anyway. i have now other ideas, how a proper implementation of a matrix class should look like (probably next week - i'm a bit sick at the moment)

elanhickler commented 4 years ago

Thanks for doing this.

Can you explain how blocksize, hopsize, trafosize would effect the filter?

Edit: Looking at this I can't see where the waveform output is... oh you renamed vector to vec.

RobinSchmidt commented 4 years ago

for filtering, the trafoSize should probably stay fixed at being equal to the blockSize. i think using zero-padding (i.e. trafoSize > blockSize) makes sense only for higher resolution analysis and visualization. it's actually a faux increase in frequency resulution - the spectrum is still washed out, you just get more spectral samples of it. hopSize = blockSize/2 is probably also best left as is. you can get similar faux time-resolution increase for visualization by choosing smaller (power-of-two) fractions here. the only parameter that really matters is the blockSize - it dials in a trade-off between time-resolution and frequency-resolution of the filter. it basically means that longer block-sizes lead to longer ringing filters that give better frequency separation. so the parameter here is the blockSize - the other two are dependent as: trafoSize=blockSize, hopSize=blockSize/2. ...i mean, you can experiment with other settings (i didn't, so far), but i would not really expect much improvement or difference from doing so

elanhickler commented 4 years ago

You have lowpass and highpass here. Wouldn't it be just as good to only do one and then subtract the result from the original?

RobinSchmidt commented 4 years ago

yes, that should work, too. i'm currently just not sure, if the subtraction will work correctly on the spectrogram level due to the flaws and/or bugs in the matrix class (i mean, if you just subtract spectrogram matrices using the "-" operator of the matrix class). but subtracting time-domain signals should work

elanhickler commented 4 years ago

ok so something I kinda get but also totally don't get.

You have two classes, the spectrogram and the sinusoidal model stuff.

Let's say I want to create a simple denoiser. 0 out bins that are not above an amplitude threshold. Let's say I want to improve that denoiser by detecting harmonics. Let's say after x seconds the harmonics of the signal start dropping below the noise floor. This is where noise reduction start to become useless. What I really want is to do a guestimate of how long the harmonics last and just resynthesize from scratch. Get rid of everything after X seconds and just resynthesize from there. That's where the sinusoidal model comes in? Is it confusing to combine these two classes into a function? How's this going to work, etc. etc. This is just brainstorming.

Edit: This is basically the function of the sample tail extender but as you said, it doesn't use sinusoidal models? It only uses spectrogram stuff and bin amplitude? So it's not as good?

elanhickler commented 4 years ago

quick question: is it ok to use float types with rsSpectrogram? Because the audio is in floats not doubles.

RobinSchmidt commented 4 years ago

is it ok to use float types with rsSpectrogram?

i didn't try that but don't see, why it should be a problem. you should just need an appropriate template instantiation

elanhickler commented 4 years ago

What's the demodulation thing about? What is modulation in this case?

elanhickler commented 4 years ago


I get some distortion at the end of the file using the lowpass/highpass stuff. I resized my audio samples so that it has blocksize number of extra 0s at the end. That seemed to fix it.

Edit: adding a single extra 0 sample seems to work as well. maybe the last number just needs to be 0.

RobinSchmidt commented 4 years ago

What's the demodulation thing about? What is modulation in this case?

before analyzing a frame, an analysis window is applied and after (re)synthesizing a frame, a synthesis window is applied. depending on the choice of these windows (and the ratio of blocksize and hopsize), there may be an amplitude modulation in the resynthesized signal - but this modulation is predictable and can be compensated for. this is the demodulation step.

see here:

in most implementations of spectrogram processing systems, the window-functions and blocksize-to-hopsize ratio is tuned such that this modulation does not occur (this happens when the (overlapped) products of analysis and synthesis windows sum up to unity). but i wanted to have more freedom in my choices of window-functions, so i incorporated this demodulation step. this will do nothing (i.e. divide by one), in case of a choice where this overlap-to-unity condition is satisfied. in other cases, it divides by the sum of the overlapped window-products

RobinSchmidt commented 4 years ago

I get some distortion at the end of the file

:-O what is this?! i've never seen that. what are your settings? maybe i should try it with your sample? or (easier for me to check) can you produce this artifact also with a simple artificial input signal (noise, dc, sine, whatever?)

maybe the last number just needs to be 0.

this would be really weird!

RobinSchmidt commented 4 years ago

oh - i just noticed that the hopsize should be blockSize/4 and not blockSize/2 for the overlapped windows to sum up to a constant (with the Hann window). blockSize/2 would be appropriate only if the window would be applied once - but it's applied twice (in analysis and synthesis) - but with halving the hopsize, we again get a sum-up-to-constant property. the hann window is really nice in this respect.

so, with blockSize/2, the demodulation actually does something - which could produce artifacts. ...although probably not of the kind that you are seeing at the end. that is probably something else.

i have checked in an update - my experiment now also plots the sum of the overlapped window products - in case you want to take a look and play around with it

elanhickler commented 4 years ago

wtf, when I combine two files using JUCE classes the result is that there's errors around -135db. If I combine using REAPER, it is -inf, so, perfect. WHYYYYYY

it's simple math!


Edit: Hmm what if the audio file writer is adding dither? No that can't be, then it wouldn't combine well in REAPER.

I changed hopsize to blocksize / 4 and removed my "fix" for the distortion at the end. I'm not getting any distortion now, I'll wait for it to happen again. It's an intermittent problem. Sometimes happens, sometimes doesn't (if I don't have my fix implemented).

RobinSchmidt commented 4 years ago

It's an intermittent problem. Sometimes happens, sometimes doesn't

could it be related to the length of the buffer/file? if it's a "nice" number (maybe that an integer number of blocks fits in or something), it doesn't happen and otherwise it does? or the other way around?

is the juce sample buffer single precision and the reaper double and we see roundoff error here or something? anyway, this is not related to my code, right?

RobinSchmidt commented 4 years ago

btw - i actually would - at the moment - recommend to prepend and append a blocksize worth of zeros before analysis and cutting it off after resynthesis (half or maybe even a quarter of that should actually suffice but better be safe). because my block overlapping may produce fade-in/out effects at start and end (which you won't ever see because of the demodulation - they'll be compensated too - but it's probably better not having to compensate for anything)

...the spectrogram stuff is still very much under development

RobinSchmidt commented 4 years ago

ah - by the way - in the function plotOverlappingWindowSum, i have made a plot of the overlapping windows and their sum, using hopSize=blockSize/4 with the Hann window (squared, because it's applied twice - if it were not squared, blockSize/2 would work). as you see, they sum up to unity:


except for the fades at the ends (because there, less windows are contributing to the sum). when you use the same overlap factor with a blackman window, you can clearly see the amplitude modulation:


...but i think, for blackman, you can get this sum-to-unity property (at least approximately) with another overlap factor, too - i have too look that up or try it... edit: ah - here:

elanhickler commented 4 years ago

Yeah I'll have to experiment with double precision. I wonder if maybe using 64-bit wav files will force JUCE to use higher precision... or at least I should try 32 bit just to see what happens.

RobinSchmidt commented 4 years ago

This is basically the function of the sample tail extender but as you said, it doesn't use sinusoidal models? It only uses spectrogram stuff and bin amplitude? So it's not as good?

in some sense, it creates a (restricted version of a) sinusoidal model from the spectrogram data. for tail extension, i actually think, the general approach is quite appropriate - if it would be improved to estimating the decay rates from the data. but with the current one-decay-rate-for-all-harmonics implementation, the tail sounds rather static and artificial.

...i don't really get, what exactly you get and also don't get

elanhickler commented 4 years ago

Yep, using a 32-bit wav file solved the slight error.

OK, so how do you create a smooth spectral filter rather than brick wall? Do you simply reduce amplitude a little more for each frequency? Wouldn't you be able to hear the steppyness of the frequency rows? Especially if the filter frequency was changing over time?

RobinSchmidt commented 4 years ago

how do you create a smooth spectral filter rather than brick wall? Do you simply reduce amplitude a little more for each frequency?

yes - i would probably linearly fade the amplitudes down over a certain number of bins. and/or maybe with a smoother (sin/cos) shaped function. we would just have to take care, that complementary low- and highpass add up to unity. ...unless you get your highpass by subtraction - then, it would be prefect reconstruction, regardless. however, the highpass may not be an exact mirror image of the lowpass, if you use just any fade-function

Wouldn't you be able to hear the steppyness of the frequency rows? Especially if the filter frequency was changing over time?

i guess, the overlap would smooth these steps out. what exactly do you mean to happen without time-modulation?

RobinSchmidt commented 4 years ago

ahh - i guess, i see what you mean: the rounding of the cutoff-bin to integer values? yeah...i could probably allow float numbers for that by scaling the last bin by a number between 0 and 1. i must think about that

elanhickler commented 4 years ago

btw, I don't see me needing a sloped filter. But I guess interpolating values is something that will be needed a lot, especially for a denoiser.

elanhickler commented 4 years ago

One thing a lot of my clients ask for is a matching filter, to for example make a loud violin sample sound like a softer violin sample, usually to correct mistakes in the recording process, or even to generate new samples that are in between recorded dynamics. You do this by analyzing the sample you want to manipulate and get some kind of difference based on the sample you want it to sound like.

Do you know how matching stuff works?

Edit: This is a spectral process.

RobinSchmidt commented 4 years ago

Do you know how matching stuff works?

hmm - do you have some example product to show me exactly what you mean? if it's about applying the spectral envelope of one signal to another signal, then i have done such things in the context of my master thesis (it was partially about spectral envelopes - and i implemented a vocoder based on this algorithm

elanhickler commented 4 years ago

Oh, well do YOU have some examples?

RobinSchmidt commented 4 years ago

ok - yeah - this is fun. this is me reading and then vocoding the title of my thesis (it's in german - "representation and modification of spectral envelopes of natural sounds based on formant models"):

the speech of the output is really intelligible .....for germans

elanhickler commented 4 years ago

So you can use this to convolve a soft guitar pluck with a loud guitar pluck and create something in between? Obviously won't sounds as good as the real thing, but just enough to be usable in music composition.

RobinSchmidt commented 4 years ago

well, it's vocoder - so the process is asymmetrical, so it's not like a morph (which i would expect to act somewhat symmetrical - and adjsutable) - carrier and modulator play different roles - but yeah, you get something "in between".

that said - morphing stuff could certainly be done as well. in fact, we'll have a lot of interesting stuff to explore with this spectrogram processor. the basic system is in place and (more or less) working - now the fun can begin

elanhickler commented 4 years ago

This is just an image comparing your voice envelope to the output envelope, not sure what purpose this image has. They look like they match up pretty well. image

This is Vocodex example but there's some problems with it. It's great for a musical sound but your version sounds like a better starting point to improve on legibility and musicality.

RobinSchmidt commented 4 years ago

it has a totally different character (thinner and a bit gnarly - i like the gnarl! a bit goa'esque). do you know how vocodex works? is it also spectrogram/stft based - or does it use the classical filter bank approach? edit: from the product desciption (

Up to 100 bands individually locatable anywhere in the spectrum. that probably means filterbank

elanhickler commented 4 years ago

Vocodex is the best vocoder for music. Here's more exaggerated example

So can you make a plugin with your vocoder?

RobinSchmidt commented 4 years ago

So can you make a plugin with your vocoder?

the algorithm is implemented as a non-realtime matlab file. in principle, it could be turned into a realtime algorithm (in fact, in rosic, i already have some sort of framework for realtime spectral processors - that factors out all of the messy re-buffering, windowing, overlapping, yadda-yadda business). but: this "true-spectral-envelope" algorithm is expensive. for a single frame, it iterates multiple fft/ifft roundtrips until convergence (typically 5-10 iterations). not really good for realtime performance. however - i have some other ideas for simpler spectral envelope estimation algorithms - based on connecting peaks by lines or splines - which should probably give similar quality

elanhickler commented 4 years ago

Nevermind that! Explain how I can do some offline tests to see if it's at all viable for morphing two similar sound sources. I could make a function in SALT and use the scripting engine to play with it.

RobinSchmidt commented 4 years ago

ok - i just added all the .m files that i wrote back then for my thesis to my research repo:

to run the vocoder, you need to install octave:

and run this script:

you should get exactly the output i posted above (the wavefile is dropped into the signals folder). put your input files there, too and modify the "audioread" calls appropriately. note that you must also give it the fundamental frequencies fo carrier and modulator (well - rough ballpark value is enough - it just scales the amount of envelope smoothing, if i remember correctly - mind you, i've not touched this code for 13 years!)

elanhickler commented 4 years ago

not sure what to do with this image

RobinSchmidt commented 4 years ago

what? why is there html code in it?! this is a matlab file!. first thing i'd recommend to do is to undock command-window from the center dock (i moved it to the right of the screen), so you can see the command window and editor window at once. then - why is your working folder D:/Desktop? isn't it supposed to be something:/RS-MET-Research/Prototypes/Octave/Thesis

my screen looks like this (after running the script - btw: warning: when finished, the script plays the resulting audio):


elanhickler commented 4 years ago

I used the vocoder to create some in-between samples of guitar dynamics. It seems to work, maybe a few improvements could possibly be made, seems worth pursuing.

RobinSchmidt commented 4 years ago

interesting, non-standard use of a vocoder! :-O can you post some results? i'm curious to hear them

elanhickler commented 4 years ago

3 dynamics to 5 dynamics. The two in-between samples are the vocoder output plus I mixed in some of the original audio by hand and adjusted overall amplitude until it sounded right.

elanhickler commented 4 years ago

for some reason I am getting an infinte hang after deletion of the 2nd matrix in the code, I think it's the 2nd one if things are deleted in the order they appeared. The debugger isn't being helpful! Nothing seems out of the ordinary.


I'm trying to zero out harmonics (small ranges of frequencies)

for (int ch = 0; ch < channels; ++ch)
    // compute the complex spectrogram:
    WindowType W = WindowType::hanningZN;
    Spectrogram sp;
    sp.setBlockAndTrafoSize(blockSize, trafoSize);
    Matrix s = sp.complexSpectrogram(origAudio->audio->getReadPointer(ch), samples);

    // workaround to create the deep copies
    int numFrames = s.getNumRows();
    int numBins = sp.getNumNonRedundantBins(); // == s.getNumColumns()        

    Matrix sl(numFrames, numBins);

    vector<int> binsToZero;
    for (double cf = f; cf < sampleRate * 0.5;)
        int lo = sp.frequencyToBinIndex(cf - fRange * 0.5, sampleRate);
        int hi = sp.frequencyToBinIndex(cf + fRange * 0.5, sampleRate);

        for (int b = lo; b < hi; ++b)

        cf += f;

    // zero out harmonic bandwiths to get only noise
    for (int i = 0; i < numFrames; i++)
        for (int b : binsToZero)
            s(i, b) = 0;

    // subtract noise only from orignal to get harmonics only
    vec x = sp.synthesize(sl);
    auto ptr = harmAudio->audio->getWritePointer(ch);
    for (int s = 0; s < samples; ++s)
        ptr[s] -= x[s];

    // transfer noise only
    ptr = noisAudio->audio->getWritePointer(ch);
    for (int s = 0; s < samples; ++s)
        ptr[s] = x[s];
} // deletion occurs here!
elanhickler commented 4 years ago

I just commented out the deletion function haha. Look at my noise/harmonic spectrograms: ezgif-5-90b779e121d0 1

I'm using a frequency bandwith of 15hz. That is INSANELY PRECISE! Your bidirectional filters could never do this.

Ok, I need to do another test though. It might be better to capture a large portion of frequencies per harmonic rather than the smallest possible portion. Again your bidirectional filters would fail with this task because it's not as flat (so there would be filter overlap, causing issues)

Also, this function is insanely fast. 1 or 2 seconds to process a 10 second stereo clip.

elanhickler commented 4 years ago

Here are the noise only waveforms:

tiny portion per harmonic: iZotope_RX_3_(64-bit)_2019-09-17_10-36-02

large portion per harmonic: iZotope_RX_3_(64-bit)_2019-09-17_10-33-06

So, the large portion is bad because you can see that the noise has a regularity to it, you can spot some oscillations. That's bad because it's going to interfere with the resynthesized harmonics when combining together. Looks like it's best to capture as tiny a range of frequencies per harmonic as possible.

Edit: Hmmm, but capturing not just a large portion but EVERYTHING (so there is no noise left), and then resynthesizing it might actually work better because you then don't need to combine noise/harmonics at the end.

Edit: I think this is what I originally wanted to do from the beginning but never could due to not having flat enough filters.

RobinSchmidt commented 4 years ago

with "portion-per-harmonic" you mean the bandwidth of the bandpass filters to isolate the harmonics?

the large portion is bad because you can see that the noise has a regularity to it, you can spot some oscillations

that makes sense. if the harmonic bandwidth gets wider, the "in-between" bandwidths of the noise bands get narrower. and narrow-bands have high sinusoidality/regularity

RobinSchmidt commented 4 years ago

I think this is what I originally wanted to do from the beginning but never could due to not having flat enough filters.

hmmmm...actually, butterworth filters are quite nicely flat. ...but maybe not steep enough? this case, one could go for elliptics at the cost of ripples (in passband and stopband). the FFT based filters have ripples, too.

That is INSANELY PRECISE! Your bidirectional filters could never do this.

so - this is good, right? i still cannot see, what fixed FFT filters can do that regular time-domain filters can't. maybe i should take the challenge of separating some signal with filters vs FFT. like a sort of battle experiment...haha

elanhickler commented 4 years ago

bidirectional filters can't change frequency right? But you could kinda move the filter around in a spectral process by changing bin volumes. Moving filters are needed.

RobinSchmidt commented 4 years ago

yes - it's difficult to make time-varying bidirectional IIR filters while preserving their desirable zero-phase properties. with FIR filters it would be easier but computationally expensive, such that you'd probably end up with doing some FFT based process here also (FFT convolution). but at them moment, we are talking about fixed filters, right? i mean the stuff you did above where you said that bidirectional filters could not do it. for time varying stuff - especially tracking harmonics frequencies - i'd also opt for spectrogram based algorithms.

elanhickler commented 4 years ago

I'm having some issues with phaselocking, basically the same issues I had last time with bidirectional filters except I think things are sounding better and easier to use. I don't understand how SampleModelling gets perfectly phaselocked samples.

Included is the original: image

phaselocked example attempts to capture the phase per harmonic with

auto p = RAPT::rsSinePhaseAt<float>(, x.size(), x.size() * 0.5);
RAPT::rsRecreateSine<float>(,, x.size(), currentf, currentf, sampleRate, p, 0);


phaselocked_singlePhase example sounds better because I don't take phase measurements but set [p]hase to "0" for all harmonics, whatever that means, but then it loses a lot of stereo information. image

I think RAPT::rsSinePhaseAt is not working well enough... or maybe I need to make a measurement exactly at a spectral peak, one of those spikes.

elanhickler commented 4 years ago

ALRIGHT YES! Taking the phase measurement precisely at the most energetic spot spectrally has improved the result: image

Now there's a weird issue with some amplitude modulation circled in red. No idea what that could be from.

elanhickler commented 4 years ago

Can you tell me how to, instead of zeroing out bins, manipulate the amplitude? First, retrieve the amplitude, and then change it to something else.

RobinSchmidt commented 4 years ago

each matrix entry is a complex number, so to get the amplitude, you can just call std::abs on it - the implementation for std::complex will extract the complex magnitude. you can also multiply the complex values by real numbers to change their magnitude. you may also want to look into std::arg and std::polar, if you want to deal with phase separately