Improve intermediate file performance

How the benchmarking runs were organised

Various runs on the production SLURM cluster at DLS were performed using

a fixed hardware configuration
a single pipeline being applied to a specific dataset

and varying the parameters relevant to intermediate data saving

Hardware

The fixed hardware configuration was as follows:

1 node
4 GPUs per node
NVIDIA V100 GPUs

Pipeline

The fixed pipeline was https://github.com/dkazanc/dls_pipelines/blob/main/pipelines/bench_pipeline_gpu_intense_separate_rescale.yaml.

Data

The fixed dataset was "the sandstone data" 119647.nxs (20GB).

Parameters investigated

The parameters that were varied were:

intermediate file format: hdf5 or zarr
unchunked vs. chunked (the chunk shape was chosen as 1 projection/sinogram, depending on the pattern of the method producing the intermediate data)
uncompressed vs. compressed (only BLOSC compression was used)
running httomo with the --save-all flag or not

Times

The speeds of pipeline execution are ordered from fastest to slowest, grouped under the headings of whether the save-all flag was used or not.

The full time taken for the pipeline to execute was taken from the logfile that httomo creates.

Without `--save-all`

zarr: chunked + uncompressed = 149s
zarr: chunked + compressed = 168s
hdf5: chunked + uncompressed = 243s
hdf5: unchunked + uncompressed = 353s
zarr: unchunked + uncompressed = 430s
hdf5: chunked + compressed = N/A (segfault when saving FBP output)

With `--save-all`

zarr: chunked + uncompressed = 275s
zarr: chunked + compressed = 342s
hdf5: chunked + uncompressed = 794s
hdf5: unchunked + uncompressed = 1148s
zarr: unchunked + uncompressed = 2233s
hdf5: chunked + compressed = N/A (segfault when saving FBP output)

Concluding remarks

hdf5 intermediate data saving can be faster by simply writing the data chunked:
- without --save-all: 243s (chunked+uncompressed) vs. 353s (unchunked+uncompressed) = 110s speedup (243/353 * 100 = 68.8% of original time, so ~30% decrease in time taken)
- with --save-all: 794s (chunked+uncompressed) vs. 1148s (unchunked+uncompressed) = 354s speedup (794/1148 * 100 = 69.1% of original time, so ~30% decrease in time taken)
zarr chunked + uncompressed seems to be the fastest
hdf5 chunked + uncompressed possibly can be faster if it's investigated more, due to a block of ~70s appearing in the nsys profile report where it's unclear what the CPU is doing, but is in some sort of blocking call that possibly involves MPI synchronisation
hdf5 chunked + compressed consistently produces a segfault when writing data, but only for the output of FBP (the output of other methods are able to be written as hdf5 chunked + compressed), so something strange is possibly going on there and is worth extra investigation
this is my first time working with zarr so far, so take this with a grain of salt: writing data in the zarr file format with any kind of non-trivial hierarchical structure using parallel processes seems to not yet be fully mature, there are synchronisation issues regarding the creation of groups which can be tricky to get around as a newbie, see https://github.com/zarr-developers/zarr-python/issues/658

More info on hdf5 chunked + uncompressed

Viewing the nsys profile reports in nsight reveals that the ~70s "wait" before writing some blocks to intermediate data occurs firstly after the first time the paganin filter method is executed (taking ~59s). It then occurs several times in the first block being processed in the section containing FBP (taking ~70s each time).

More specifically, the stripe removal method's output being written in that section doesn't cause the wait, but

FBP's output causes it
median filter's ouptut causes it
calculate stats' output causes it

all only on the first block iteration within that section), see the screenshot below

hdf5-chunked-save-all-waits

Info that could help understand why this is happening

Regarding my comment about this possibly being related to MPI synchronisations occurring, this was due to seeing function calls in the region where the waiting/blocking occurs mentioning mutexes: hdf5-chunk-save-all-message-1

Another interesting function call that seems to appear early on in every instance where the waiting/blocking occurs mentions "memcpy" and "unaligned": hdf5-chunk-save-all-message-2

The name of the region being "vdso" also seems noteworthy. At first there was a suspicion it was referring to "VDS (virtual datasets)", a feature of hdf5. However, this feature in hdf5 isn't being used when writing data in httomo. Given that some of the functions in that region appear to be system calls, the latest speculation is that maybe it's referring to "vDSO" that is part of some binary executables https://www.man7.org/linux/man-pages/man7/vdso.7.html

DiamondLightSource / httomo