dayalannair / FMCW_FFT_Radar

Implementation of the Xilinx FFT IP using a UART packetised data stream
MIT License
7 stars 1 forks source link

Simulation timing accuracy #36

Open dayalannair opened 2 years ago

dayalannair commented 2 years ago

Hi @jpt13653903 I've noticed this before when tutoring EEE4120F, the timings of FPGA functions are always much faster than MATLAB even for small data sets. For example, my simulation indicates that it takes 25 us from the time the first sample is sent to the FFT to the time the last output is received (excl UART ofcourse). The matlab FFT, considering the operation only, takes about 1 ms. Surely the FPGA cant be this much faster for such a small set?

jpt13653903 commented 2 years ago

How many point FFT are you running? The Matlab below shows me that a Matlab 256-point FFT takes 1.07 μs.

close all
clear all
clc

format compact
format short eng
%% -----------------------------------------------------------------------------

N = 256
x = complex(randn(1, N), randn(1, N));

start = tic()

for n = 1:1e6
  X = fft(x);
end

Time = toc(start) / 1e6
%% -----------------------------------------------------------------------------
jpt13653903 commented 2 years ago

If I move the random input generator to inside the test loop it goes up to 7 μs, but that's because the random number generator is expensive.

jpt13653903 commented 2 years ago

Another interesting experiment (see below). You'll notice that the time only converges when you have a test loop size of at least 1000. You have to have it that big to cater for the overhead involved otherwise.

close all
clear all
clc

format compact
format short eng
%% -----------------------------------------------------------------------------

N = 256;
M = [1 10 100 1e3 10e3 100e3 1e6 10e6];
Time = [];
%% -----------------------------------------------------------------------------

for m = M
  x = complex(randn(1, N), randn(1, N));
  start = tic();

  for n = 1:m
    X = fft(x);
  end

  Time = [Time (toc(start) / m)];
end
%% -----------------------------------------------------------------------------

figure;
loglog(M, Time); grid on;
xlabel('Loop Size');
ylabel('Time [s]');
%% -----------------------------------------------------------------------------
dayalannair commented 2 years ago

Interesting...will look into this. Surely using a loop contributes to the execution time? Especially since its an interpreted language.. So the main difference is that Im doing the FFT row wise and not averaging using a for loop. This should still give the correct time, atleast once in multiple runs? image Realistically I would use the code below, as this operation is also done be the FFT black box on FPGA: image

jpt13653903 commented 2 years ago

Matlab post 2013 does a just-in-time compilation, so it's not "really" interpreted, but if you're worried about the loop, run the one below. Mine does 1.2 μs per row. Point is, to make the overhead negligible, the number of FFTs you need to run is MANY.

close all
clear all
clc

format compact
format short eng
%% -----------------------------------------------------------------------------

N = 256;
M = 1e6;
%% -----------------------------------------------------------------------------

x = complex(randn(M, N), randn(M, N));
start = tic();

X = fft(x, [], 2);

Time = toc(start) / M
%% -----------------------------------------------------------------------------
jpt13653903 commented 2 years ago

And, just for interest:

close all
clear all
clc

format compact
format short eng
%% -----------------------------------------------------------------------------

N = 256;
M = [1 10 100 1e3 10e3 100e3 1e6];
Time = [];

for m = M
  x = complex(randn(m, N), randn(m, N));
  start = tic();

  X = fft(x, [], 2);

  Time = [Time (toc(start) / m)];
end
%% -----------------------------------------------------------------------------

figure;
loglog(M, Time/1e-6); grid on;
xlabel('Number of runs');
ylabel('Time [\mu{s}]');
%% -----------------------------------------------------------------------------

image

dayalannair commented 2 years ago

Interesting... should I then report on the average/over a number of runs execution time? In a real system, where would the timing lie, i.e. does it have an initial slow performance and then settle or will it always run in the low performance region? Not too sure what the overheads relate to and if some processes are being cached even though the data is changing.

jpt13653903 commented 2 years ago

The overheads include, but is not limited to:

In a real system (with a Matlab-based processor) you'll get relatively consistent performance, because you'll process a burst (or multiple bursts) at a time -- for this exact reason. It's more efficient to process more at a time. You also have less OS-related overhead, because the thread doing the processing is consistently busy, so the OS will schedule accordingly.

For a GPU-based implementation you'll typically perform manual scheduling to hide data transfer latency, so you'll get relatively consistent performance there as well. In this case you'll most likely be processing burst to burst, as with the CPU implementation.

In FPGA you are only constrained by the clock and the number of multipliers your circuit is using. You do not have any overhead of the CPU / GPU nature, because everything happens in parallel.

dayalannair commented 2 years ago

Great information, thanks. I found that averaging the runtime for 1000 runs results in between 1.1 and 2.2 μs (for FFT on the same sweep). Multiple runs in using a loop keeps the OS occupied/thread open/cache warm and so overheads are reduced. Running it each time requires a new thread to be opened or something like that. Will look into it some more, though your info is good. Probably very unnecessary, but should i run the fft on different sweeps each time instead of the same one?

jpt13653903 commented 2 years ago

should i run the fft on different sweeps each time instead of the same one?

Depends on your compiler, and what optimisation it's doing.... So it is very difficult to answer.

Ideally you would gather a burst of sweeps, and then process the burst. So, to mimic that, you would probably want to run different sweeps.

jpt13653903 commented 2 years ago

Running it each time requires a new thread to be opened or something like that.

I don't believe that. New thread creation is very expensive. Matlab specifically have only one thread (or at least last time I checked).

This said, they are using the FFTW library as an engine, and it could be multithreaded during the operation... So maybe you're right...

dayalannair commented 2 years ago

should i run the fft on different sweeps each time instead of the same one?

Depends on your compiler, and what optimisation it's doing.... So it is very difficult to answer.

Ideally you would gather a burst of sweeps, and then process the burst. So, to mimic that, you would probably want to run different sweeps.

This makes sense. For the FPGA we can assume that the runtime is exactly the same for every sweep (deterministic) so would ofcourse not require averaging.