PollyNET / Pollynet_Processing_Chain

NRT lidar data processing program for multiwavelength polarization Raman lidar network (PollyNET)
https://polly.tropos.de/
GNU General Public License v3.0
20 stars 8 forks source link

Make plot generation faster #75

Open martin-rdz opened 4 years ago

martin-rdz commented 4 years ago

Currently generating the plots from the results takes roughly 5min for 6h of observations [at least for the 13 channel LACROS system on rsd2]. Profiling reveals, that a single colorplot takes approximatly 4-6sec with the current setup.

Switching to an alternative matplotlib backend (quick test with gr) did not provide improvements.

Further ideas:

Opinions? Suggestions?

ZPYin commented 4 years ago

Hi Martin, thanks for your efforts and suggestions.

I knew the data visualization speed would be a hidden issue quite some time ago, as the plots are massive compared with many other projects. The motivation for choosing matplotlib is also the much higher efficiency of data visualization. Therefore, I'm quite eager to try your suggestions to improvement the efficiency, taking into account the expansion of PollyNET in the upcoming years.

With regard to your test results for each single plot, it seems to be much longer than I expected. I would suggest to check whether there were multiple tasks running in the backend to comsume the CPU resources.

As I checked the recent log files in rsd1, it still takes less than 1s (144 plots within 2 min). If so, I wonder whether there are still space for improvement...

Start time of data visualization image

Stop time of data visualization image

I guess the hardware of rsd2 should be better than the rsd1 (Hope so... πŸ˜„). Therefore, under normal situation, it should be in capable of processing 10 pollys in parallel in the new server. So..., maybe we can still relax for another two years...

But it's very interesting to discusse the data visualization and I will leave this issue open. Any comments are welcomed!!!

martin-rdz commented 4 years ago

This is from yesterdays test run on rsd2, with 4 out of 8 cpus being idle. Processor frequency is same as for rsd_old, but number of cores increased.

[2020-10-20 17:54:13] Start to visualize results.                                                                                                                                                                                            
...
[2020-10-20 17:58:12] Finish.

It's 4 instead of 5 mins, because i tried the subsampling for the saturation plots.

In my option problem is not the operational processing, but reprocessing old datasets (over and over again, when the algorithm improves). I guess there is still room for improvement, though it might be tricky to implement. A framerate of 1.2fps should not be the technical limit ;)

Let's keep the discussion open.

HolgerPollyNet commented 3 years ago

I just read a German forum entry, in which it was stated that the time format used is very important. I have no clue yet, in which format the currently the times are handed over to matplotlib for the time-height-plots, but maybe it is worth investigating this.

ZPYin commented 3 years ago

Matlab Parallel Processing could be a feasible way for speeding up data visualization. It can reduce the time usage by several times, depending on how many figures to be ploted. (see the test script below)

a = 1:100;
b = sin(a/100 * pi);

% single processing
startTime1 = now;
for i=1:100
figure('visible', 'off'); 
plot(a, b); hold on; 
plot(a, b); hold on;
plot(a, b); hold on;
plot(a, b); hold on;
plot(a, b); hold on;
end
stopTime1 = now;

% parallel processing
startTime2 = now;
parfor (i=1:100, 10)
figure('visible', 'off'); 
plot(a, b); hold on; 
plot(a, b); hold on;
plot(a, b); hold on;
plot(a, b); hold on;
plot(a, b); hold on;
end
stopTime2 = now;

fprintf('Time usages: %f vs %f\n', stopTime1 - startTime1, stopTime2 - startTime2);

But it throws error of 'out of memory' when I implemented parallel processing for Picasso. This was caused by the multiple instances of data kept within the parallel workspace. Anyway, this error can be resolved by some coding tricks.

So let's keep this in mind.

HolgerPollyNet commented 2 years ago

Hey @ulysses78 , could you take care of this issue whenever you have time for it? Currently the plotting is the most time consuming process in the chain.

@martin-rdz suggestest already some solutions:

1: https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.pcolorfast.html

2: https://docs.rs/matfile/latest/matfile/ docs.rs (https://docs.rs/matfile/latest/matfile/) matfile - Rust Matfile is a library for reading (and in the future writing) Matlab β€œ.mat” files.

and

docs.rs (https://docs.rs/plotters/latest/plotters/) plotters - Rust Plotters - A Rust drawing library focus on data plotting for both WASM and native applications πŸ¦€πŸ“ˆπŸš€

HolgerPollyNet commented 2 years ago

Currently, tre most time consuming is the plotting of the vertical profiles.....:

[2022-01-27 08:11:43] --> start displaying overlap function. [2022-01-27 08:11:45] --> finish. [2022-01-27 08:11:45] --> start displaying vertical profiles. [2022-01-27 08:19:12] --> finish. [2022-01-27 08:19:12] --> start displaying attenuated backscatter. [2022-01-27 08:19:38] --> finish.

Maybe this can be handled first @ulysses78 ?

ZPYin commented 2 years ago

Hi, I just did some code analysis in terms of data visualization speed. The bottleneck of the speed is on the python script, to be specific, matplotlib.savefig.

The python script consumes more than 80% of the total running time. And savefig, which was used for figure output, takes half of python running time (0.5 s per frame). image

I did some research and can't find solution to improve it if we rely on matplotlib. Because matplotlib is optimized for high quality figures, not for execution speed (correct me if I'm wrong πŸ˜„ ).

So it's good to really try a different data visualization scenario, rust or whatever.

ulysses78 commented 2 years ago

What about PyQtGraph as an alternative for matplotlib? Citation from https://www.pyqtgraph.org/ "Despite being written entirely in python, the library is very fast due to its heavy leverage of NumPy for number crunching and Qt's GraphicsView framework for fast display."

ulysses78 commented 2 years ago

I started with this issue: https://github.com/PollyNET/Pollynet_Processing_Chain/issues/163 (Separate processing and plotting). In the future all the visualizations will be created with python only. At the same time I changed using pcolormesh by using imshow. Imshow is up to 8-10 times faster in plotting. As we all know, imshow has problems when there are gaps within the matrix. That's why I fill all the time-gaps in the matrix beforehand with nan-values (same for mask-matrix). This works very nice! The plots are looking very much the same, but will be created much faster.