Open ax3l opened 10 years ago
This is not quite correct and not directly a libsplash issue. The PDC constructor assumes nothing about the MPI_Info object, only PIC sets it to MPI_INFO_NULL right now. So this is actually a PIC issue.
Just a place to collect the data.
Actually it is a parallel HDF5 issue which is in turn a MPI-IO issue which is in turn a ROMEO issue in most of the cases. Since there might be a general solution for that which might be file system dependent (lustre vs gpfs vs whoKnows) it might be possible to find a generic solution based on the information we provide to libSplash.
Anyway, the first hacks will go into PIConGPU for sure.
first diff on PIConGPU: (see ref)
diff --git a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
index 3f27d8f..261d943 100644
--- a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
+++ b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
@@ -35,6 +35,7 @@
#include "particles/frame_types.hpp"
#include <splash/splash.h>
+#include <hdf5.h>
#include "fields/FieldB.hpp"
#include "fields/FieldE.hpp"
@@ -322,9 +323,33 @@ private:
if ( mThreadParams.dataCollector == NULL)
{
GridController<simDim> &gc = Environment<simDim>::get().GridController();
+
+ /* hacked: some flags for HDF5 and MPI-I/O */
+ //H5Pset_sieve_buf_size( fapl_id, 4194304 ); /* 4MB; >=FS Blocksize*/
+ //H5Pset_alignment( fapl_id, 4194304, 2097152 ); /* 4/2MB ~same */
+
+ MPI_Info info = MPI_INFO_NULL;
+ MPI_Info_create( &info );
+
+ /*MPI_Info_set( info, "cb_nodes", "32"); aggregators: ref may be wrong,
+ should be stripe factor / ~2-4
+ MPI_Info_set( info, "striping_factor", "32");
+ MPI_Info_set( info, "striping_unit", "4194304"); bytes per stripe*/
+
+ MPI_Info_set (info, "cb_align", "2");
+ /*MPI_Info_set (info, "cb_nodes_list", "*:*"); */
+ MPI_Info_set( info, "direct_io", "true" );
+ /* lustre specific (no locking): Disable ROMIO's data-sieving */
+ MPI_Info_set( info, "romio_ds_read", "disable" );
+ MPI_Info_set( info, "romio_ds_write", "disable" );
+ /* lustre specific (no locking): Enable ROMIO's collective buffering */
+ MPI_Info_set( info, "romio_cb_write", "enable" );
+ MPI_Info_set( info, "romio_cb_read", "enable" );
+ /* some buffer - check which */
+ MPI_Info_set( info, "cb_buffer_size", "268435456" ); /* 256MB */
+
mThreadParams.dataCollector = new ParallelDomainCollector(
gc.getCommunicator().getMPIComm(),
- gc.getCommunicator().getMPIInfo(),
+ info,
splashMpiSize,
maxOpenFilesPerNode);
}
and on libSplash:
diff --git a/src/ParallelDataCollector.cpp b/src/ParallelDataCollector.cpp
index b8874ea..9e34929 100644
--- a/src/ParallelDataCollector.cpp
+++ b/src/ParallelDataCollector.cpp
@@ -50,9 +50,12 @@ namespace splash
// set new cache size
H5Pget_cache(fileAccProperties, &metaCacheElements, &rawCacheElements, &rawCacheSize, &policy);
- rawCacheSize = 64 * 1024 * 1024;
+ rawCacheSize = 256 * 1024 * 1024;
H5Pset_cache(fileAccProperties, metaCacheElements, rawCacheElements, rawCacheSize, policy);
+ H5Pset_sieve_buf_size(fileAccProperties, 4194304); /* 4MB; >=FS Blocksize */
+ H5Pset_alignment(fileAccProperties, 4194304, 2097152); /* 4/2MB ~same */
+
log_msg(3, "Raw Data Cache = %llu KiB", (long long unsigned) (rawCacheSize / 1024));
}
Note: using a plain unpatched version of PIConGPU and splash and increasing the # of OSTs by hand lfs setstripe -c 16 /path
speeds the 128 GPU test already up a bit (vs. 4 OSTs default).
This can be done in a machine specific job file template with lfs setstripe -c 16 /path/to/run/base/dir
on lustre, which would allow to omit that in MPI_Info at least.
I configured for testing purposed T3PIO with
./configure --prefix=$MEMBERWORK/$proj/lib/t3pio --with-phdf5=$HDF5_DIR --host x86_64 --with-numcores=1 --with-node-memory=32000
on Titan.
Performance so far equal to non-tuned version for various OST configs on the root dir of the sim via lfs setsstripe -c <n> /dir/to/sim
.
I tried it again with --with-numcores=1
to reflect the 1 output task/GPU and node better https://github.com/TACC/t3pio/issues/3: gave me a significant improvement in speed (over --with-numcores=16
). It also looks like a 1.25 to 1.5 increase over providing no MPI_Info
.
diff --git a/src/picongpu/CMakeLists.txt b/src/picongpu/CMakeLists.txt
index 27fd54e..3aa47ff 100644
--- a/src/picongpu/CMakeLists.txt
+++ b/src/picongpu/CMakeLists.txt
@@ -252,6 +252,12 @@ if(Boost_VERSION EQUAL 105500)
"${CUDA_NVCC_FLAGS} \"-DBOOST_NOINLINE=__attribute__((noinline))\" ")
endif(Boost_VERSION EQUAL 105500)
+################################################################################
+# Find T3PIO
+################################################################################
+
+include_directories(SYSTEM "$ENV{T3PIO_ROOT}/include")
+set(LIBS ${LIBS} "$ENV{T3PIO_ROOT}/lib/libt3pio.a")
################################################################################
# PMacc options
diff --git a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
index a780e3d..c2d9d5e 100644
--- a/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
+++ b/src/picongpu/include/plugins/hdf5/HDF5Writer.hpp
@@ -35,6 +35,8 @@
#include "particles/frame_types.hpp"
#include <splash/splash.h>
+#include <t3pio.h>
+#include <boost/filesystem.hpp>
#include "fields/FieldB.hpp"
#include "fields/FieldE.hpp"
@@ -83,6 +85,7 @@ using namespace splash;
namespace po = boost::program_options;
+namespace fs = boost::filesystem;
/**
* Writes simulation data to hdf5 files using libSplash.
@@ -231,9 +234,43 @@ private:
if (mThreadParams.dataCollector == NULL)
{
GridController<simDim> &gc = Environment<simDim>::get().GridController();
+
+ /* hacked: some flags for HDF5 and MPI-I/O */
+
+ MPI_Info info = MPI_INFO_NULL;
+ MPI_Info_create( &info );
+ int globalSize = gc.getGpuNodes().productOfComponents();
+
+ fs::path p(h5Filename);
+ const char *dir;
+ dir = p.parent_path().c_str();
+
+ t3pio_set_info( gc.getCommunicator().getMPIComm(),
+ info,
+ dir,
+ T3PIO_GLOBAL_SIZE,
+ globalSize);
+
mThreadParams.dataCollector = new ParallelDomainCollector(
gc.getCommunicator().getMPIComm(),
- gc.getCommunicator().getMPIInfo(),
+ info,
splashMpiSize,
maxOpenFilesPerNode);
}
IMO HDF5 fill_value is not used by default:
Documentation:
H5D_FILL_TIME_IFSET
Write fill values to the dataset when storage space is allocated only if there is a user-defined fill value, i.e., one set with H5Pset_fill_value. (Default)
We should make use of a MPI_Info object.
Some ressources:
Tacc Parallel I/O Workshop (page 56ff. PHDF5/MPI_Info)striping_factor
/striping_unit
)direct_io
should be rathe true; env variableMPICH_MPIIO_HINTS
instead of coding; stating to usecb_align 2
)fill_value
viaNULL
H5Pset_fill_value and H5D_FILL_TIME_NEVERH5Pset_alignment
to disk block size reported to improve performance, alsoIBM_largeblock_io=true
for MPI hints in H5Pset_fapl_mpioHacking
(assumes MPI_INFO_NULL)takes the user input from the constructor