Open tonyhutter opened 4 years ago
We have had research efforts looking at compression in the past (both lossless and lossy). For lossless compression, I think we got about 10-20% savings in size. Most of the data in these applications are floating point, which is difficult to compress. Lossy compression can do much better, but for that, one has to work with the application developers to figure out how much loss is tolerable.
There were some old compression functions in SCR that I had experimented with once. In case that's helpful later: https://github.com/LLNL/scr/blob/legacy/src/scr_compress.c
A little off topic, but related... for users who are willing to let us compress their files, they might also be willing to let us combine their many files into fewer files. We had something like that in the old SCR called "containers". This basically appended data from MPI ranks back-to-back into large, fixed-size container files. For that, you have to do some math to determine where each rank needs to write its data in those container files: https://github.com/LLNL/scr/blob/68414920bf40f85afce8c88c9b042ba30a928f49/src/scr_flush.c#L394 https://github.com/LLNL/scr/blob/68414920bf40f85afce8c88c9b042ba30a928f49/src/scr_flush.c#L649 https://github.com/LLNL/scr/blob/68414920bf40f85afce8c88c9b042ba30a928f49/src/scr_flush_sync.c#L100
That info was maintained in the SCR metadata for the dataset, and we used that info to read the files back out: https://github.com/LLNL/scr/blob/68414920bf40f85afce8c88c9b042ba30a928f49/src/scr_fetch.c#L141
This data compression and file aggregation would be very useful for the memory-based interface of VeloC.
We should consider adding compression in SCR. We mention wanting to do it in https://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi and in src/scr_io.c:
I can see cases where it would be beneficial, and cases where it wouldn't. If we did it, I'd recommend we use zstandard (https://github.com/facebook/zstd) which is currently the best compressor/decompressor out there.