Shrediquette / PIVlab

Particle Image Velocimetry for Matlab, official repository
https://shrediquette.github.io/PIVlab/
MIT License
126 stars 30 forks source link

Optimize memory usage #68

Closed Shrediquette closed 1 year ago

Shrediquette commented 2 years ago

We are using a 25 megapixel camera for piv now, and I am experiencing huge delays when using more than 2 cores (available physical cores =8). This is most likely because virtual RAM on the hard disk is used to move variables around. I can see this as a huge increase of hard disk activity in the task manager. This is a serious performance issue, but I am not sure how to solve this. Most likely, piv_fftmulti must be optimised to use less temporary variables. It may also be that the fft2 and ifft2 simply need that much memory and no further optimization is possible (I ran the profiler with the memory option, and these two functions consume most memory). I attached a picture with the processing time Vs. Nr of cores (with a maximum image size of only 6 megapixels and not 25 megapixels), and it shows that not so many cores can be used in PIVlab as I would wish. The optimum is 5 cores for my laptop (8 physical cores, 16GB ram). Can we find the reason for this...?

parallel_speed_vs_image_size.jpg

Shrediquette commented 2 years ago

I tried splitting the huge images into smaller images and processing these parts in parallel. That somewhat reduces the problem, but it would require quite some code rewrite. And nevertheless, my computer reaches it's top speed with 5 out of 8 cores, and the CPU load doesn't increase above 55%. Also the RAM use is at about 50%. I Have the feeling that everything could be faster, but where is the bottleneck...?

drcswang commented 2 years ago

My guess of the bottleneck is the reading of image data into your computer RAM. Each time only one process can access the io device whereas other processes have to wait. The increase of cores also increases the queuing time. I've tested the performance of the code on a hard disk drive and SSD. The latter is significantly faster. Anyway, I'll look at this issue and see if there is a better solution without using SSD.

quynhneo commented 2 years ago

I'd like to reproduce the above plot using my computer. Do you have the test script+images?

Shrediquette commented 2 years ago

Hi, I have a NVMe SSD, and the load is constantly at zero % as long as the RAM is not full. Then it suddenly jumps to 100%...

The test script is here, it does everything automtically, but takes some time to do the tests: https://github.com/Shrediquette/PIVlab/tree/main/test_parallel_performance

And here are my results with this script: Williams_results

Shrediquette commented 2 years ago

I think the questions I would like to answer are the following:

Shrediquette commented 2 years ago

Another test I ran this night with larger images. In large images, the highest computation speed is achieved when not using parallel computing at all: parspeed2

Shrediquette commented 2 years ago

From some more tests I did, I would conclude that the main issue is the size (in bytes) of result_conv respectively the input matrices to the cross-correlation: image1_cut and image2_cut. I can see that by using single instead of double precision for the variables, the RAM problem can be halved.

The reason has nothing to do with the parallel computing as far as I know. Parallel computing just makes the symptoms worse.

I wrote a script that demonstrates the drop in speed when the matrices for the 3D cross-correlation become too large (see attachements). I will ask a question in "Matlab Answers" for ideas how to additionally prevent this drop in speed. Maybe there is a way to check the available memory on the fly and then divide the 3D matrices into smaller pieces and process them in series? That might work...

Here is the code for testing the effect of the size of the input variables of the cross-correlation:


clear all %#ok<CLALL>
close all
clc

% Usually, in PIV, we have input image A and input image B which are
% captured with a short pause in between. These images are then cut into
% small pieces of e.g. 64x64 pixels. Each of these "interrogation areas"
% in image A is then cross-correlated with the same part from image B.
% With the cross-correlation code in PIVlab, it is possible to to this for
% a 3D matrix (saves a lot of time, but apparently runs into RAM problems 
% when matrices are large).

counter=0;
stack_sizes=1000:10000:200000 ;%these numbers are fine to demonstrate the effect on a 16 GB RAM laptop  
for size_of_the_stack=stack_sizes 
    %% generate some quick + dirty "particle" image pairs, arranged in a 3D matrix
    A=rand(64,64,size_of_the_stack);
    A(A<0.98)=0;
    A = imgaussfilt(A,0.9);
    B = circshift(A,round(rand([1,1])*10)); %displace the second image
    %% converting to single already saves 50% of memory, apparently without negative effects.
    A=single(A);
    B=single(B);
    %% do the cross-correlation with FFT
    result_conv = zeros(size(A),'single'); %same starting conditions for all repetitions of the loop %#ok<PREALL> %
    tic
    result_conv = fftshift(fftshift(real(ifft2(conj(fft2(A)).*fft2(B))), 1), 2); %cross-correlation code in PIVlab
    counter=counter+1
    calc_time(counter)=toc;
    time_per_subimage(counter)=calc_time(counter)/size_of_the_stack;
end
%% plot results
bar(stack_sizes,time_per_subimage*1000)
grid on;
xlabel('stack size')
ylabel('time per correlation in ms')

The time for a cross-correlation suddenly goes up when RAM is full: time_per_image_pair

The hard disk activity also goes up when RAM is full (virtual memory is used I guess which slows the process down): task

Shrediquette commented 2 years ago

The Matlab Answers discussion is here: https://de.mathworks.com/matlabcentral/answers/1452894-sudden-drop-in-speed-for-large-matrices-ram-full-prevention

ErichZimmer commented 2 years ago

Have you tried real FFTs? They use less memory and are typically faster.

Shrediquette commented 2 years ago

Have you tried real FFTs? They use less memory and are typically faster.

I might already do that:

​result_conv ​=​ ​fftshift​(​fftshift​(​real​(​ifft2​(​conj​(​fft2​(​image1_cut​)).*​fft2​(​image2_cut​))), ​1​), ​2​);

However, this code was written and tested a decade ago. But I think I tested all kinds of methods extensively for speed and accuracy at that time... Btw. I wrote you an email, did you receive it?

ErichZimmer commented 2 years ago

Not quite. Maybe something like

​result_conv ​=​ ​fftshift​(​fftshift​(​real​(​irfft2​(​conj​(​rfft2​(​image1_cut​)).*r​fft2​(​image2_cut​))), ​1​), ​2​);

but I don't have a MATLAB license so I can't test it out. My python GUI (based on openpiv_tk_gui) uses

f2a = conj(rfft2(image_a))
f2b = rfft2(image_b)
r = f2a * f2b
r = np.sqrt(np.absolute(f2a) * np.absolute(f2b))
r = np.divide(f2a * f2b, r, where = (p != 0))
corr = fftshift(irfft2(r).real, axes=(-2, -1)) * 100 # so it would be more compatible with float32

and subsequently, uses less memory when doing parallel processing on low performance laptops with less than 2 GB available RAM.

ErichZimmer commented 2 years ago

On PIV Challenge 2014 case B with interrogation windows of 64x64, 32x32, 16x16 with 50% overlap and 3rd order bivariate spline deformation, I can get 0.2508 vec/ms. In comparison, Fluere has 0.2410 vec/ms with 7th order sinc deformation and same windowing/overlap.

ErichZimmer commented 2 years ago

@Shrediquette Btw. I wrote you an email, did you receive it? Sorry for the late response, but it seems that I have not received an email on/ around November 5 +- 5 days.

Shrediquette commented 1 year ago

Solved! https://github.com/Shrediquette/PIVlab/commit/0dfe5f609db8c7e3ae7902ced91fe158574f9213