csu-hmc / perturbed-data-paper

A paper on an elaborate gait data set.
https://peerj.com/articles/918/
10 stars 6 forks source link

Frequency spectra are noisy #77

Closed tvdbogert closed 9 years ago

tvdbogert commented 9 years ago

@spinningplates @moorepants I'm having trouble downloading the data (stalls before it finishes, on my 4th attempt now) so I could not play with this. I was able to look at the code and the graphs that were there.

The spectra give some idea of the frequency content but I think they should be more smooth.

Because of the long duration of the signal, you get an extremely high resolution in the frequency domain. A signal of 10000 seconds has a resolution of 1/10000 Hz. You don't need such a high resolution. We don't need to know if the signal strength of the 1 Hz component of the signal is different from the 1.0001 Hz component.

A few things I would try:

  1. No zero padding. You can either truncate the signal at the largest power of 2 that is smaller than the signal length, or simply ask FFT to use the actual number of samples rather than a power of 2. This will be slower but perhaps still fast enough. If this eliminates the noise, no need to go further.
  2. Do the spectrum of many shorter pieces in the signal (short enough that you get the required frequency resolution) and then average those. I found an interesting example here: http://www.dspguide.com/ch9/1.htm

You could use the current method and then smooth the spectrum before plotting it, I'm not sure if that's considered valid.

moorepants commented 9 years ago

I just sent you trial 079.

Currently the spectra don't really give a "nice" picture of the frequency content. I don't think smoothing them is a bad idea. I don't think it would be invalidating what we want to show.

Using a shorter duration of the signal would probably be fine.

tvdbogert commented 9 years ago

Thanks for sending the file. I will try a few things on this.

On 12/14/2014 7:40 PM, Jason K. Moore wrote:

I just sent you trial 079.

Currently the spectra don't really give a "nice" picture of the frequency content. I don't think smoothing them is a bad idea. I don't think it would be invalidating what we want to show.

Using a shorter duration of the signal would probably be fine.

— Reply to this email directly or view it on GitHub https://github.com/csu-hmc/perturbed-data-paper/issues/77#issuecomment-66937940.

tvdbogert commented 9 years ago

Header

This is not as easy as I thought. It's not completely clear to me yet, but here's what I found so far.

  1. Computation speed is still very fast if the "nextpow2" was eliminated and made no difference in result. So the zero padding did not cause any trouble, but the code could be a little simpler.
  2. The output signals came from the record module, so they were not sampled at 100 Hz. They are not even sampled at a constant rate. I resampled these at 100 Hz and at 1000 Hz, then did the FFT. This shifted the spectrum somewhat to the right, and also eliminated the peak at 3.333 Hz. I think this is better. Resampling at 100 or 1000 seemed to make no difference.
  3. On a loglog plot, you can see that the spectrum dips at 0.02344 Hz and its multiples. This is where most of the "noise" comes from. This corresponds to 42 s duration and is probably due to the file having two sections with constant speed. This noise is gone when I only use the 8 minutes of the perturbation signal. There is still noise left, though.
  4. Smoothing the spectrum with a moving average makes the spectrum look "nice" and it still has the information content we want to communicate. I am pretty sure that the moving average is mathematically equivalent to doing the spectrum of many short pieces of the file, and averaging those. Now I wonder if the spectrum would even look OK if I used the entire original file, with the zero belt speed and constant belt speed sections included. Maybe, and that would simplify things somewhat.

Here is a plot and the code that produced it (this does not show that I selected only the 8 minutes of perturbed walking).

image

    % resample to 1000 Hz
    Fs = 1000;
    T = 1/Fs;
    output_slow = interp1(output_slow_time, output_slow, output_slow_time(1):T:output_slow_time(end));
    %Lengths
        L_output_slow=length(output_slow);
    %NFFT
        NFFT_output_slow = (L_output_slow);
    %FI
        fi_output_slow = Fs/2*linspace(0,1,NFFT_output_slow/2+1);
    %Y
        Y_output_slow = abs(fft(output_slow,NFFT_output_slow))/L_output_slow;

    % smooth the spectrum by doing a moving average
    window = 0.1;           % smoothing window in Hz
    Nwindow = round(window/(Fs/NFFT_output_slow))
    smY_output_slow = conv(Y_output_slow,ones(Nwindow,1)/Nwindow,'same'); 

    figure(2)
    plot(fi_output_slow,2*Y_output_slow(1:NFFT_output_slow/2+1),'Color',[0.2157 0.4706 0.7490])
    hold on
    plot(fi_output_slow,2*smY_output_slow(1:NFFT_output_slow/2+1),'r')
    legend('Output, not smoothed','smoothed')
    title('Slow Walking (0.8 m/s)','Fontweight','bold')
    ylabel('Amplitude','Fontweight','bold')
    ylim([0 0.003]); xlim([0 8])
humphreysb commented 9 years ago

I'm downloading the data now (should have taken you up on the USB offer). Couple of comments on this because I noticed the spectrum in the paper too (all of this you may know or may know in more detail, but I thought I would add my 2 cents):

1) "Standard" practice, as known to me, is to average the individual shorter time spectra together (referred to as spectral averaging -> think ensemble averaging). The duration of each average is referred to as the temporal resolution (frequency resolution is then the df inside each fft or PSD). You can also min/max envelope to show variation. I don't think I have seen smoothing done by moving average across the spectrum itself. Spectral averaging is good for beating down transient noise as it gets "averaged out". But for the same reason spectral averaging should only be used on data that is fully periodic data and is "stationary" (non-changing mean).

2) I have not looked through the code yet, but a guess is that you are not demeaning the data and/or not hanning/hamming windowing? The 0Hz spike seems to indicate that.

3) For amplitude - Why not scale the fft to get to PSD units?

4) We almost never worry about power of 2 nfft any more. Power of 2 Nfft was dictated by speed not accuracy. Faster algorithms and computers have alleviated those limitations. No need to worry about zero padding or clipping data lengths to size anymore.

5) If you do need to use the full time history in one spectrum, there are a couple techniques such as the cumulative power density that can give you a nice smooth plot of RMS vs. Frequency.

I've got a bunch of code that does all of this and will run the data through it. If you want to make a bunch of PSDs to average together quickly, take a look at spectrogram.m and pwelch.m.

moorepants commented 9 years ago

@humphreysb Thanks for these tips. I didn't know much of it. How do you do #1? Do you have code or a reference that shows this?

moorepants commented 9 years ago

Here is a short script in my issue-77 branch that has the hamming window and Ton's moving average:

https://github.com/csu-hmc/perturbed-data-paper/blob/issue-77/src/treadmill_motion.py

It produces these two figures:

figure_1 figure_2

The first shows what the moving average of 20 points does on the un-windowed FFT. The second shows the difference in hamming windowing and not (moving average applied). The hamming window doesn't seem to clean up the spectrum much. This is only from the 8 min perturbation data sampled at 100 hz. This is trial 20, I didn't do 79 (like Ton did) for some reason.

moorepants commented 9 years ago

@humphreysb

3) For amplitude - Why not scale the fft to get to PSD units?

I think we'd like amplitude in the units of the signal. What are PSD units and why are they useful?

4) We almost never worry about power of 2 nfft any more. Power of 2 Nfft was dictated by speed not accuracy. Faster algorithms and computers have alleviated those limitations. No need to worry about zero padding or clipping data lengths to size anymore.

Good to know. All examples I've ever come across have all this. Do you have a nice example of doing this without all that?

5) If you do need to use the full time history in one spectrum, there are a couple techniques such as the cumulative power density that can give you a nice smooth plot of RMS vs. Frequency.

Example?

I've got a bunch of code that does all of this and will run the data through it. If you want to make a bunch of PSDs to average together quickly, take a look at spectrogram.m and pwelch.m.

Where are these files?

humphreysb commented 9 years ago

This may have turned into so much more then you want to know.....

I have MATLAB code, but no Python. In MATLAB spectrogram.m (in the Signal Processing Toolbox) can make the series of fft (actually PSDs) that you can then average across. I think there should also be a function in base MATLAB still around called specgram.m. Pwelch and specgram return pretty good results, we tweeked them for scaling issues we found and wrapped a bunch of analysis around them.

So the window does 2 things all related to the "start-up" and "stop" transients at the beginning and end of the signal: 1) If the first sample (or the last) is a non-zero value, due to where you "grabbed" the data, it has an artificial offset and so too will the spectrum. 2) The offset also "rings" the spectrum like an impulse transient. In an unaveraged spectrum, it will show up as high frequency ringing. Keep in mind, both of these are a function of which sample you chose to start with in any fft. Windows such as Hanning or Hamming, "pinch" the ends of the signal to zero. Ahhh consistency! There are then scaling factors that you multiply the FFT to adjust for the pinching.

Power 2 was needed when FFT algorithms used a scheme called radix-2. And even then it was only really a speed consideration. Here is a good review of the algorithms but I might also be able to find something in a text book that you can reference: http://www.mathworks.com/company/newsletters/articles/faster-finite-fourier-transforms-matlab.html Functions such as spegram where speed is paramount and signal length (zero-padding) is not a concern still use power2 NFFT. (You can just not use the last PSD returned from specgram when you spectral average).

FFT units in the raw are notoriously mis-scaled. Often times, they don't take into account 1/Nfft or 1/fs that is embedded in the fft algorithm. We almost never report them. While the units look convenient they're just problematic. You are interested in where the power resides in the spectrum. Some people say "who cares about the units". But then why not just do it correct and use PSD units? If you want to have some fun... run a signal with a couple of pure sinusoids and some random Gaussian noise through a FFT algorithm with raw units output; it probably won't match your expectations. That's why you mostly see PSD units used when there is random content.

Units: FFT: vel (vel=m/s) PSD: vel_rms^2/Hz <---- Not intuitive units, but that's power and tells you where the signal's power resides.

So what if you want more intuitive units: If you take the square root of the area under a section of the PSD, you get units: vel_rms (this is called Parseval's theorm). These are signals with random components, so RMS values are correct. What if you start at 0Hz and open up more and more of that area? You get what is called Cumulative Power Spectral Density.

I'll look at the data from the report. For illustration, I grabbed a quick example from data I took tap testing some hardware to determine it's first flexible body mode (this breaks some of my own comments above as I was measuring a transient). See how you get nice vertical lines in the cumulative accel plot where there is energy? (It the signal was periodic and not a transient, the value in the octave band plot at the bottom would exactly match the RMS value of the time history, but alas I am using a transient).

screenshot from 2014-12-16 12 07 28

I can see if I can get together some MATLAB code that does this and get it to you. Is MATLAB OK? Do you have signal processing toolbox? I may punt though on the rest of the review to get it done.

I'll also see if I can get the in-house code that we developed here to you. We (another engineer and I) rolled our own functions and validated them. Years of work in that. I have always wanted to check into whether I could release into the public domain, but have never done it. I might be able to.

moorepants commented 9 years ago

Thanks Brad that is helpful and not more than I wanted to know. I always like knowing more :)

But, I'm not sure if we are trying to show what a periodogram shows. We simply want to give an idea of the frequency content of the signal and how far away from white noise it is.

And yes if you have any Matlab code that is fine. But before you spend any time writing the code to process this data, some clarity is needed in what we are trying to show.

BTW, this is Python's equivalent to spectogram: http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.signal.periodogram.html

Can you clarify the fundamental differences in a periodigram/spectorgram and the Fourier Transform plot?

Also to note, we are not worried about speed of calculation here. All we are doing is making this one figure.

moorepants commented 9 years ago

Here is a suggestion on smoothing:

The variance problem can be reduced by smoothing the periodogram. Various techniques to reduce spectral bias and variance are the subject of spectral estimation.

One such technique to solve the variance problems is also known as the method of averaged periodograms[8] or as Bartlett's method. The idea behind it is, to divide the set of N samples into L sets of M samples, compute the discrete Fourier transform (DFT) of each set, square it to get the power spectral density and compute the average of all of them. This leads to a decrease in the standard deviation as \frac{1}{\sqrt{L}}.

From: http://en.wikipedia.org/wiki/Periodogram

tvdbogert commented 9 years ago

Doesn't Matlab have a tool that does the periodogram this way? That would be best but we may lose the Octave compatibility.

This suggestion was also in the article I linked to initially. But, as I said before, I am very confident that this is mathematically equivalent to doing what I did: take the noisy, high-resolution spectrum on the entire data record and then do a moving average to reduce the resolution in the frequency domain.

I don't think we need to worry about the vertical axis, but PSD in proper units would be great if we can. We would need to take the square of the abs(fft) I think, and then divide by NFFT. Or something like that.

humphreysb commented 9 years ago

I made up some example code:

fs=1000;  %Sample Rate (sa/sec)
t=0:1/fs:10;  %Time Vector

%Make 3 sine waves
y1=1*sin(10.*t.*2.*pi  + pi*rand(1));  %10Hz with random phase
y2=1*sin(50*t*2*pi + pi*rand(1));  % 50Hz with random phase
y3=1*sin(100*t*2*pi + pi*rand(1));  %100Hz  with random phase

r=rand(1,length(t))*0.1;  %Random Noise (set it to 0 to see no noise)

x=y1+y2+y3+r;  %Add the three sinusoids together.  
               %To check scaling, just use one at a time.  More than one
               %Sinusoid may cause con/de-structive interference. I think this is also where we found
               %some scaling issues with MATLAB's algorithm.

x=x-mean(x);  %If you comment out this line, you will get the spike at 0Hz

nfft=1000;  %Each FFT will have a second of data in it
w=nfft;  %The window to use; this defaults to using a hamming window
noverlap=nfft/2;  %Overlap the fft by 50%.  since we are overlapping 50%,
                  %we will have a temporal resolution of 0.5sec.
                  %(noverlap=0, the temporal resolution = 1 sec because of 
                  %the nfft).

[s,f,tpsd,p] = spectrogram(x,w,noverlap,nfft,fs);

pmean=mean(p,2);  %spectral Averaging

grms=cumtrapz(f,pmean);  %Parseval's Theorem
grms=(grms.^0.5);

%%  Plotting
subplot(3,1,1)
plot(t,x);
ylabel('Accel (g)')

subplot(3,1,2)
plot(f,pmean)
set(gca,'xscale','log')
set(gca,'yscale','log')
ylabel('PSD (g_r_m_s^2/Hz)')

subplot(3,1,3)
plot(f,grms)
set(gca,'xscale','log')
xlabel('Frequency (Hz)')
ylabel('Cummulative RMS (g_r_m_s)')

%% Make a spectrogram
figure
spectrogram(x,w,noverlap,nfft,fs);

psd

spec

The averaging: aver

moorepants commented 9 years ago

The two graphs Sandy created use Matlab not Octave. Right now this paper depends on Matlab. That's fine.

I'm fine with the moving average too. Just as long as we say that is what we did. But after reading more, I don't think de-noising this is trivial. There are lots of methods and lots of things you sacrifice with each method.

moorepants commented 9 years ago

I was wrong about the periodogram. It is different that a spectrogram (which does it over intervals like you've shown). Here is a python version I found: http://matplotlib.org/api/mlab_api.html#matplotlib.mlab.specgram

moorepants commented 9 years ago

This is what I got running a a variation of Brad's code on our data:

figure_4

Code is in the issue-77 branch.

moorepants commented 9 years ago

It sure does smooth the hell out of it. Maybe I'm doing something wrong.

tvdbogert commented 9 years ago

Looks like you did the spectrum on 480 pieces of 1 second each. A spectrum on a 1-second long signal has a resolution of 1 Hz. That's too coarse. We want something like 0.1 Hz resolution. So make 48 pieces of 10 seconds, do the spectrum on each, and average those.

moorepants commented 9 years ago

Here is 48 sections of 1000 samples each.

figure_4

Looks better and quite smooth.

tvdbogert commented 9 years ago

That's very nice and also consistent with the moving-averaged full-resolution spectrum. I would suggest to not use the logarithmic vertical axis, to show more clearly which frequencies were in the signal and which weren't. That's how Sandy had it originally. The high-frequencies get too much attention on the log scale.

@spinningplates @moorepants you guys can decide who does the code for the Figure that goes in the paper.

moorepants commented 9 years ago

Here is the plot with no log scales.

figure_4

tvdbogert commented 9 years ago

Now the peak is narrow, which is good, but the frequency resolution should be better to show the peak well (right now it looks like 1 data point defining the peak).

So, maybe 20 seconds, or 40 seconds pieces?

moorepants commented 9 years ago

20 second sections: 20sec 40 second sections: 40sec

tvdbogert commented 9 years ago

20 seconds has my vote.

moorepants commented 9 years ago

@spinningplates Do you want to use Brad's code to change your plots? You can use the code I have in branch issue-77 for reference on the numbers I used or you can modify it: https://github.com/csu-hmc/perturbed-data-paper/blob/issue-77/src/treadmill_motion.py

skhnat commented 9 years ago

Yes, I will try to use these suggestions to update the plots.

moorepants commented 9 years ago

PR #98 closes this.