Refactor data management?

jgroehl commented 2 years ago

The more I think about it, the less happy I am with our "one HDF5 container throughout the entire simulation pipeline" data management. I think we could be a lot more efficient using numpy arrays internally and offering a data_packaging module in which we give the user the option (if they want to) to repackage the simulation at the end of a pipeline run.

For my simulations, I am currently creating ~40GB HDF5 files, only to later manually save them into 10MB (!) custom npz files...

HDF5 also has other issues that might make it less flexible for us to use and I think access to individual data fields can be a lot slower compared to a one-file-per-item approach DURING SIMULATION. Repackaging into a single file afterwards might be much appreciated and good.

What do you think?

mschllnbrg commented 2 years ago

Yes, I agree!

I am currently having a similar problem: Cropping 3D simulation data into 2D data after the simulation pipeline is very inefficient using the large hdf5 files. The proposed option to repack the simulation data at the end of the pipeline sounds reasonable to me!

Btw.: Processing reloaded data (= loading data after the end of the pipeline) with the goal to save the processed data in a different SAVE_PATH compared to the original SAVE_PATH is another inefficient issue. As far as I know, one needs to reload the data, change the SAVE_PATH, save the data, and then reload the data with the changed SAVE_PATH again.

jgroehl commented 2 years ago

For the last part you could use shutil and manually copy the files. But agreed that this is a feature that could be added :)

IMSY-DKFZ / simpa

Refactor data management? #131