New GUI idea based on H5Py

ErichZimmer commented 3 years ago

Background/Issue

The current GUI stores data in separate files which can make it hard to do more thorough data processing. To combat this, an already suggested solution was to store all results in a single dictionary of the dataset and export the results in a manner the user deems sufficient. However, on large processing sessions (>60,000 images), the GUI can become quite slow especially on lower performing laptops. Furthermore, the performance of the GUI starts to decrease with these large sessions. This can cause a disruption to efficient work flows and an increase in glitches (mostly applies lower performing computers).

Proposed solution

After exploring different ways of storing huge amounts of data, H5Py was found to perform pretty well even on underperforming computers (e.g. my laptop 😢). When properly configured, most data is stored in the hard drive, leaving RAM mostly unused unlike dictionary style designs. Additionally, the structuring of an HDF5 file makes it very simple to load specific sections of data/results which has its advantages. Taking advantage of these features, the HDF5 file is structured like the following;

session: The main file which everything is stored in
session\images: Group containing all image filenames and frames for the GUI
session\images\files_a: Dataset of full path lengths of A frames
session\images\files_b: Dataset of full path lengths of B frames
session\images\frames: Dataset of Frame names used for GUI frame list and file names for results
session\results: Group containing all results
session\results\frame_...: Group containing components of a result
session\results\frame_...\x: Dataset containing x component (there would be a dataset for each component in the group)
session\results\frame...Attrs: Attributes containing processing time, ROI and mask coordinates for further processing and display

Possible downfalls

It takes somewhat careful planning to get H5Py to work as needed. If structured wrong, the advantages of a HDF5 based storage format is waivered and performances are similar to that of a dictionary style storage format. However, this requires some major mistakes and shouldn't really happen.
Extra dependency (H5Py)
To get multiprocessing to work, there will most likely be an extra dependency

PS, I'm back 😁 (got medically discharged from an injury) and ready to relearn everything/hopefully not be so ill-informed on testing methods like I was back then -_-. Additionally, your inputs on using HDF5 or others for storage would be helpful for further research and designing.

alexlib commented 2 years ago

Is the repository public?

see my fork - I invited you https://github.com/alexlib/Open_PIV_mapping

ErichZimmer commented 2 years ago

That repository is much different then my attempt which uses a meshed region of interest and for-loops to find the points. As soon as I get a decent internet connection, I'll hopefully get everything pushed to a fork or repository for everyone to see (including my spaghetti-coding skills :P).

ErichZimmer commented 2 years ago

Just did some tests and your fork/repository is quite better and more robust then my implementation. I'll see if there is some enhancements/refactoring I can do. Does this repository allow for the calibration of vectors as a post-processing method?

alexlib commented 2 years ago

It is Theo work in progress, he have chosen to work in the image space. But it should work on vectors as well

ErichZimmer commented 2 years ago

I played around with different ideas and keep reverting back to Theo's repository. The calibration seems pretty simple and would be a nice addition for OpenPIV. On another note, I tested rectangular windows with the GUI and it works like a charm except that it's 50% slower than square windows. Here is a screenshot of raw vectors using circular correlation. rectangular_windows

ErichZimmer commented 2 years ago

The image pair is from PIV Challenge 2014 case A (testing micro-PIV).

ErichZimmer commented 2 years ago

To avoid major overhead with shared dictionaries, the files are stored in a temporary folder before loaded into the GUI and deleted. This makes multiprocessing as fast as the simple GUI and removes the need for a batch size. Is this method alright? New processing steps:

Initiate multiprocessing class
Run multiprocessing
Process image pairs
Save results as dictionary stored in .npz files
Load said files into h5py
Delete temporary files

alexlib commented 2 years ago

I am not quite sure about the step of saving to npz and then loading to hdf5 - could it be maybe stored already in hdf5, to save one conversion or loading/saving step?

ErichZimmer commented 2 years ago

H5py doesn't directly support parallel writing, so it's this wierd work around or the other one based on a shared memory dictionary that is then loaded into h5py. I am still looking for better options through mpi4py, but so far, it isn't successful and complicates the installation process of the GUI. In my opinion, this issue is one of the few problems with h5py where other libraries (Ex: not using h5py like in the simple GUI) would be better.

alexlib commented 2 years ago

I understand. So there are two options: a) use mutiprocessing and RAM - to keep all the parallel results in memory, b) store every result by a separate worker to a temporary file and then combine them. I guess if there is a significant speed-up in the option b) compared to a single thread / single processing path - let's do it this way.

alexlib commented 2 years ago

take a look at zarr https://github.com/pydata/xarray/issues/3096 https://github.com/pydata/xarray/pull/4035 https://zarr.readthedocs.io/en/stable/tutorial.html

can it help? it seems to have some solution and it's pip-installable.

ErichZimmer commented 2 years ago

take a look at zarr I looked at it and it seems promising and easy to implement with minimal change in code.

on calibration I got somewhat familiar with the image calibration interface and I like it so far. However, an improvement in precision can be attained by using a centroid algorithm and find_first_peak/find_second_peak in pyprocess. My version of the calibration software follows the instructions of an article mentioned previously and ignores scaling to minimize user input. It is based off of Theo's script and Fluere, and can only be applied to the vector field via for loop. I still like Theo's script more, though, as it is more flexible.

ErichZimmer commented 2 years ago

It would be great to incorporate the script into something like OpenPIV.tools or its own calibration file as some cameras (e.g. my raspberry pi controlled 1 Mp global shutter sensor) have quite a fisheye distortion and messes up the measurements.

ErichZimmer commented 2 years ago

The subpixel function works for the original script, so I'll simply use the original script by Theo.

alexlib commented 2 years ago

It would be great to incorporate the script into something like OpenPIV.tools or its own calibration file as some cameras (e.g. my raspberry pi controlled 1 Mp global shutter sensor) have quite a fisheye distortion and messes up the measurements.

Good idea. Move the discussion to openpiv-Python repo issues please

ErichZimmer commented 2 years ago

Zarr is creating a file for each frame, so I'll have to figure out what I'm doing wrong here. It does allow multiprocessing though ;)

ErichZimmer commented 2 years ago

Using npy files wasn't a smart decision. They save and load fast, but the individual file sizes can get up to 3 MB for 50,000 vectors. For large sessions, this uses up quite a bit of space before it is deleted. Zarr is still making a bunch of files and in a way, acts like the temporary npy files. I'll try mpi4py again for built in parallel with h5py. Additionally, h5py files can get quite large, with some being >20 GB for large processing sessions. However, a similar amount of space is taken by text files.

ErichZimmer commented 2 years ago

Using a batch system similar to the shared memory dictionary system, the results can be processed in parallel and loaded in serial. If we are to use this system, then Zarr might be a good file system to use as it operates in a very similar fashion with multiple linked files.

ErichZimmer commented 2 years ago

It also allows for exporting the session in HDF5 and netCDF.

ErichZimmer commented 2 years ago

I found that the temporary file system works best, so I'll keep it to now. It doesn't take any extra space on the hard drive.

ErichZimmer commented 2 years ago

Here is the somewhat buggy h5py gui. https://github.com/ErichZimmer/openpiv_tk_gui/tree/GUI_enhancement2

ErichZimmer commented 2 years ago

It requires h5py as an extra dependency.

ErichZimmer commented 2 years ago

To not pollute your GUI with features that cannot be merged (at least I wasn't able to due to my basic programming knowledge), I'm going to close this issue so I can focus more on your GUI.

ErichZimmer commented 2 years ago

I also moved the h5py GUI to a new repository to eliminate accidentally pushing the wrong GUI to my fork of your GUI. https://github.com/ErichZimmer/openpiv-python-gui

I honestly like your GUI a little more because of its simplicity.

OpenPIV / openpiv_tk_gui