Investigate performance

c-mita commented 6 years ago

Single threaded execution using Dectris' neggia plugin

$ cat input.txt
/home/mew09119/neggia/build/src/dectris/neggia/plugin/dectris-neggia.so
/dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_??????.h5
1 600

$ time <input.txt build/test_plugin
 enter parameter of LIB= keyword:
 enter parameter of NAME_TEMPLATE_OF_DATA_FRAMES= keyword:
 enter parameters of the DATA_RANGE= keyword:
 master_file=/dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5
 [generic_data_plugin] - INFO - generic_open
       + library          = </home/mew09119/neggia/build/src/dectris/neggia/plugin/dectris-neggia.so>
       + template_name    = </dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5>
       + dll_filename     = </home/mew09119/neggia/build/src/dectris/neggia/plugin/dectris-neggia.so>
       + image_data_filename   = </dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5>
This is neggia 1.0.1 (Copyright Dectris 2017)
 [generic_data_plugin] - INFO - generic_get_header
nx,ny,nbyte,qx,qy,number_of_frames=  2070  2167     2  0.000075  0.000075 12000
INFO(1:5)=vendor/major version/minor version/patch/timestamp=   1   0   5   3          -1
 generic_getfrm: data are from Dectris
 average counts:  -4.08068784E-02
 [generic_data_plugin] - INFO - 'call generic_close()'
 [generic_data_plugin] - INFO - 'plugin close flag:           0 '

real    0m9.716s
user    0m8.825s
sys 0m0.576s

Single threaded execution using durin

$ cat input.txt
build/durin-plugin.so
/dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_??????.h5
1 600

$ time <input.txt build/test_plugin 
 enter parameter of LIB= keyword:
 enter parameter of NAME_TEMPLATE_OF_DATA_FRAMES= keyword:
 enter parameters of the DATA_RANGE= keyword:
 master_file=/dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5
 [generic_data_plugin] - INFO - generic_open
       + library          = <./build/durin-plugin.so>
       + template_name    = </dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5>
       + dll_filename     = <./build/durin-plugin.so>
       + image_data_filename   = </dls/mx-scratch/mew09119/eiger_dectris/trypsin_6_topup_3_1_master.h5>
 [generic_data_plugin] - INFO - generic_get_header
nx,ny,nbyte,qx,qy,number_of_frames=  2070  2167     2  0.000075  0.000075 12000
INFO(1:5)=vendor/major version/minor version/patch/timestamp=   1   0   0   0          -1
 generic_getfrm: data are from Dectris
 average counts:  -4.08068784E-02
 [generic_data_plugin] - INFO - 'call generic_close()'
 [generic_data_plugin] - INFO - 'plugin close flag:           0 '

real    0m19.783s
user    0m14.379s
sys 0m0.384s

That's over double the time required...

Allowing parallel execution using OpenMP in the host process (four threads).

neggia

$ time <input.txt build/test_plugin
....
real    0m3.183s
user    0m9.107s
sys 0m1.093s

durin

$ time <input.txt build/test_plugin
...
real    0m16.435s
user    0m14.656s
sys 0m1.043s

This is a poor showing...

c-mita commented 6 years ago

It is known that the HDF5 library is not very efficient in multi-threaded contexts - read operations in particular are much faster in multi-process contexts.

Perhaps we could look at spawning a second process and transfer the data across shared memory? Sound unpleasant. HDF5 1.10.2 adds a H5DOreach_chunk function to the high level interface - the counter-part to the H5DOwrite_chunk function used by the odin file writer. It skips all pipeline operations (leaving it up to us to decompress) but may perform much faster.

c-mita commented 6 years ago

Some early investigation using H5DOread_chunk yields mixed results.

Skipping a significant amount of the work done by the HDF5 library allows more of the work to be done in parallel (particularly decompresion), but complicates the code and requires addition of the decompression libraries. The calls to read the data chunks are still have locks, so the degree of speed-up is highly dependent on the performance of the file system, but what I've seen so far suggests a 50-100% slow-down at worst when compared to neggia's approach, with only minor differences when reading data off an SSD (delta of a few seconds when reading 12000 frames - 0-20%).

I'll start trying to implement this strategy in the plugin to be used so long as all conditions for it are met (bitshuffle_lz4 being the only filter in the pipeline, one frame == one chunk, etc).

c-mita commented 6 years ago

This branch https://github.com/DiamondLightSource/durin/tree/chunk_read contains the work required to use H5DOread_chunk. There is some refactoring work to do to cleanup the code and remove unnecessary work.

c-mita commented 5 years ago

The chunk_read branch was merged a while ago, and has significantly improved performance in the multi-threaded case, given a reasonably performant file-system. In poor cases (poor mounts to networked filesystems) it isn't as good, but there's not too much that can be done there.

DiamondLightSource / durin

Investigate performance #7