DiamondLightSource / httomo

High-throughput tomography pipeline
https://diamondlightsource.github.io/httomo/
BSD 3-Clause "New" or "Revised" License
4 stars 3 forks source link

Improve intermediate file performance #271

Open yousefmoazzam opened 3 months ago

yousefmoazzam commented 3 months ago

How the benchmarking runs were organised

Various runs on the production SLURM cluster at DLS were performed using

and varying the parameters relevant to intermediate data saving

Hardware

The fixed hardware configuration was as follows:

Pipeline

The fixed pipeline was https://github.com/dkazanc/dls_pipelines/blob/main/pipelines/bench_pipeline_gpu_intense_separate_rescale.yaml.

Data

The fixed dataset was "the sandstone data" 119647.nxs (20GB).

Parameters investigated

The parameters that were varied were:

Times

The speeds of pipeline execution are ordered from fastest to slowest, grouped under the headings of whether the save-all flag was used or not.

The full time taken for the pipeline to execute was taken from the logfile that httomo creates.

Without --save-all

  1. zarr: chunked + uncompressed = 149s
  2. zarr: chunked + compressed = 168s
  3. hdf5: chunked + uncompressed = 243s
  4. hdf5: unchunked + uncompressed = 353s
  5. zarr: unchunked + uncompressed = 430s
  6. hdf5: chunked + compressed = N/A (segfault when saving FBP output)

With --save-all

  1. zarr: chunked + uncompressed = 275s
  2. zarr: chunked + compressed = 342s
  3. hdf5: chunked + uncompressed = 794s
  4. hdf5: unchunked + uncompressed = 1148s
  5. zarr: unchunked + uncompressed = 2233s
  6. hdf5: chunked + compressed = N/A (segfault when saving FBP output)

Concluding remarks

yousefmoazzam commented 3 months ago

More info on hdf5 chunked + uncompressed

Viewing the nsys profile reports in nsight reveals that the ~70s "wait" before writing some blocks to intermediate data occurs firstly after the first time the paganin filter method is executed (taking ~59s). It then occurs several times in the first block being processed in the section containing FBP (taking ~70s each time).

More specifically, the stripe removal method's output being written in that section doesn't cause the wait, but

all only on the first block iteration within that section), see the screenshot below

hdf5-chunked-save-all-waits

Info that could help understand why this is happening

Regarding my comment about this possibly being related to MPI synchronisations occurring, this was due to seeing function calls in the region where the waiting/blocking occurs mentioning mutexes: hdf5-chunk-save-all-message-1

Another interesting function call that seems to appear early on in every instance where the waiting/blocking occurs mentions "memcpy" and "unaligned": hdf5-chunk-save-all-message-2

The name of the region being "vdso" also seems noteworthy. At first there was a suspicion it was referring to "VDS (virtual datasets)", a feature of hdf5. However, this feature in hdf5 isn't being used when writing data in httomo. Given that some of the functions in that region appear to be system calls, the latest speculation is that maybe it's referring to "vDSO" that is part of some binary executables https://www.man7.org/linux/man-pages/man7/vdso.7.html