Performance Tuning: MPI_Info and H5P Options

ax3l commented 10 years ago

We should make use of a MPI_Info object.

Some ressources:

HLRS/Jaguar Wiki (MPI_Info)
ORNL Lustre Basics (striding)
Jaguar Lustre example (old MPI_Info flags, includes H5Pset_sieve_buf_size and H5Pset_alignment tweaks, see this talk)
H5Pset_meta_block_size might be worth a look, too
~~Tacc Parallel I/O Workshop (page 56ff. PHDF5/MPI_Info)~~
Uni Delaware (prefers infos striping_factor/striping_unit)
Tips on Cray XT5, Cray XE6, Cray Lustre p 56. (stating direct_io should be rathe true; env variable MPICH_MPIIO_HINTS instead of coding; stating to use cb_align 2)
hdf5group: PHDF5 hints
T3PIO: Auto-set MPI_Info for Lustre+MPI-I/O (=PHDF5), slides
remove the fill_value via NULL H5Pset_fill_value and H5D_FILL_TIME_NEVER
GPFS: H5Pset_alignment to disk block size reported to improve performance, also IBM_largeblock_io=true for MPI hints in H5Pset_fapl_mpio
h5perf: user guide
Hacking
the PDC constructor allows for a MPI_Info object already ~~(assumes MPI_INFO_NULL)~~ takes the user input from the constructor

f-schmitt commented 10 years ago

This is not quite correct and not directly a libsplash issue. The PDC constructor assumes nothing about the MPI_Info object, only PIC sets it to MPI_INFO_NULL right now. So this is actually a PIC issue.

ax3l commented 10 years ago

Just a place to collect the data.

Actually it is a parallel HDF5 issue which is in turn a MPI-IO issue which is in turn a ROMEO issue in most of the cases. Since there might be a general solution for that which might be file system dependent (lustre vs gpfs vs whoKnows) it might be possible to find a generic solution based on the information we provide to libSplash.

Anyway, the first hacks will go into PIConGPU for sure.

ax3l commented 10 years ago

first diff on PIConGPU: (see ref)

diff --git a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
index 3f27d8f..261d943 100644
--- a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
+++ b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
@@ -35,6 +35,7 @@
 #include "particles/frame_types.hpp"

 #include <splash/splash.h>
+#include <hdf5.h>

 #include "fields/FieldB.hpp"
 #include "fields/FieldE.hpp"
@@ -322,9 +323,33 @@ private:
         if ( mThreadParams.dataCollector == NULL)
         {
             GridController<simDim> &gc = Environment<simDim>::get().GridController();
+
+            /* hacked: some flags for HDF5 and MPI-I/O */
+            //H5Pset_sieve_buf_size( fapl_id, 4194304 ); /* 4MB; >=FS Blocksize*/
+            //H5Pset_alignment( fapl_id, 4194304, 2097152 ); /* 4/2MB ~same */
+
+            MPI_Info info = MPI_INFO_NULL;
+            MPI_Info_create( &info );
+
+            /*MPI_Info_set( info, "cb_nodes", "32"); aggregators: ref may be wrong,
+                                                     should be stripe factor / ~2-4
+            MPI_Info_set( info, "striping_factor", "32");
+            MPI_Info_set( info, "striping_unit", "4194304"); bytes per stripe*/
+
+            MPI_Info_set (info, "cb_align", "2");
+            /*MPI_Info_set (info, "cb_nodes_list", "*:*"); */
+            MPI_Info_set( info, "direct_io", "true" );
+            /* lustre specific (no locking): Disable ROMIO's data-sieving */
+            MPI_Info_set( info, "romio_ds_read", "disable" );
+            MPI_Info_set( info, "romio_ds_write", "disable" );
+            /* lustre specific (no locking): Enable ROMIO's collective buffering */
+            MPI_Info_set( info, "romio_cb_write", "enable" );
+            MPI_Info_set( info, "romio_cb_read", "enable" );
+            /* some buffer - check which */
+            MPI_Info_set( info, "cb_buffer_size", "268435456" ); /* 256MB */
+
             mThreadParams.dataCollector = new ParallelDomainCollector(
                         gc.getCommunicator().getMPIComm(),
-                        gc.getCommunicator().getMPIInfo(),
+                        info,
                         splashMpiSize,
                         maxOpenFilesPerNode);
         }

and on libSplash:

diff --git a/src/ParallelDataCollector.cpp b/src/ParallelDataCollector.cpp
index b8874ea..9e34929 100644
--- a/src/ParallelDataCollector.cpp
+++ b/src/ParallelDataCollector.cpp
@@ -50,9 +50,12 @@ namespace splash

         // set new cache size
         H5Pget_cache(fileAccProperties, &metaCacheElements, &rawCacheElements, &rawCacheSize, &policy);
-        rawCacheSize = 64 * 1024 * 1024;
+        rawCacheSize = 256 * 1024 * 1024;
         H5Pset_cache(fileAccProperties, metaCacheElements, rawCacheElements, rawCacheSize, policy);

+        H5Pset_sieve_buf_size(fileAccProperties, 4194304); /* 4MB; >=FS Blocksize */
+        H5Pset_alignment(fileAccProperties, 4194304, 2097152); /* 4/2MB ~same */
+
         log_msg(3, "Raw Data Cache = %llu KiB", (long long unsigned) (rawCacheSize / 1024));
     }

ax3l commented 10 years ago

Note: using a plain unpatched version of PIConGPU and splash and increasing the # of OSTs by hand lfs setstripe -c 16 /path speeds the 128 GPU test already up a bit (vs. 4 OSTs default).

This can be done in a machine specific job file template with lfs setstripe -c 16 /path/to/run/base/dir on lustre, which would allow to omit that in MPI_Info at least.

ax3l commented 10 years ago

I configured for testing purposed T3PIO with ./configure --prefix=$MEMBERWORK/$proj/lib/t3pio --with-phdf5=$HDF5_DIR --host x86_64 --with-numcores=1 --with-node-memory=32000 on Titan.

Performance so far equal to non-tuned version for various OST configs on the root dir of the sim via lfs setsstripe -c <n> /dir/to/sim.

I tried it again with --with-numcores=1 to reflect the 1 output task/GPU and node better https://github.com/TACC/t3pio/issues/3: gave me a significant improvement in speed (over --with-numcores=16). It also looks like a 1.25 to 1.5 increase over providing no MPI_Info.

diff --git a/src/picongpu/CMakeLists.txt b/src/picongpu/CMakeLists.txt
index 27fd54e..3aa47ff 100644
--- a/src/picongpu/CMakeLists.txt
+++ b/src/picongpu/CMakeLists.txt
@@ -252,6 +252,12 @@ if(Boost_VERSION EQUAL 105500)
       "${CUDA_NVCC_FLAGS} \"-DBOOST_NOINLINE=__attribute__((noinline))\" ")
 endif(Boost_VERSION EQUAL 105500)

+################################################################################
+# Find T3PIO
+################################################################################
+
+include_directories(SYSTEM "$ENV{T3PIO_ROOT}/include")
+set(LIBS ${LIBS} "$ENV{T3PIO_ROOT}/lib/libt3pio.a")

 ################################################################################
 # PMacc options
diff --git a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
index a780e3d..c2d9d5e 100644
--- a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
+++ b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
@@ -35,6 +35,8 @@
 #include "particles/frame_types.hpp"

 #include <splash/splash.h>
+#include <t3pio.h>
+#include <boost/filesystem.hpp>

 #include "fields/FieldB.hpp"
 #include "fields/FieldE.hpp"
@@ -83,6 +85,7 @@ using namespace splash;

 namespace po = boost::program_options;
+namespace fs = boost::filesystem;

 /**
  * Writes simulation data to hdf5 files using libSplash.
@@ -231,9 +234,43 @@ private:
         if (mThreadParams.dataCollector == NULL)
         {
             GridController<simDim> &gc = Environment<simDim>::get().GridController();
+
+            /* hacked: some flags for HDF5 and MPI-I/O */
+
+            MPI_Info info = MPI_INFO_NULL;
+            MPI_Info_create( &info );
+            int globalSize = gc.getGpuNodes().productOfComponents();
+
+            fs::path p(h5Filename);
+            const char *dir;
+            dir = p.parent_path().c_str();
+
+            t3pio_set_info( gc.getCommunicator().getMPIComm(),
+                            info,
+                            dir,
+                            T3PIO_GLOBAL_SIZE,
+                            globalSize);
+
             mThreadParams.dataCollector = new ParallelDomainCollector(
                                          gc.getCommunicator().getMPIComm(),
-                                         gc.getCommunicator().getMPIInfo(),
+                                         info,
                                          splashMpiSize,
                                          maxOpenFilesPerNode);
         }

psychocoderHPC commented 9 years ago

IMO HDF5 fill_value is not used by default:

Documentation: H5D_FILL_TIME_IFSET Write fill values to the dataset when storage space is allocated only if there is a user-defined fill value, i.e., one set with H5Pset_fill_value. (Default)

ComputationalRadiationPhysics / libSplash

Performance Tuning: MPI_Info and H5P Options #108

Some ressources:

Hacking