I want to check in code that eliminates the restriction that the dimensions of the process topology (XLEN,YLEN,ZLEN) must divide the corresponding dimensions of the mesh (nxc, nyc, nzc).
Until now this has always been assumed in the code, and since the resolution of
issue #10 this assumption has been explicitly enforced.
One of the main motivations for eliminating this restriction is that I/O performance is potentially compromised by the current restriction. The fundamental issue is that, for every nonperiodic dimension, the dimension (e.g. nxn) of the array of nodes that must be saved is one greater than the dimension (e.g. nxc) of the array of cells that must be saved: nxn=nxc+1. (Note that boundary nodes on the face between process subdomains are shared by the two processes). This means that it is impossible under the current restriction for all processes to write the same amount of data if redundant data is to be avoided. This has performance implications for parallel HDF5 in its current state. Eliminating this restriction will allow the user to decrement nxc by 1 so that nxn is divisible by XLEN; if XLEN divides nxc and likewise for the y and z dimensions, then the subarray of nodes written by each process will be the same, allowing for efficient use of chunking and collective I/O.
To give some background, HDF5 allows use of chunking when saving a 3D array. This means that the array is partitioned into 3D chunks, and each chunk has the same dimensions.
Collective I/O means that intermediate buffering agents are used to gather information for processes that need to access data belonging to a contiguous region of a file. The relationship between chunk dimensions and the subgrid dimensions of each process determines whether collective I/O can be used and if so whether collective I/O is per-chunk or trans-chunk.
In the current implementation of parallel HDF5, collective I/O is not used when processes are cooperating to write a single array even if collective I/O is requested unless one of the following two requirements is met: the first requirement is that the portion of the array data written by each process must be contained in a single chunk. In which case the data for each chunk is collected and written independently. The documentation of the second case is less clear to me, but it seems that each chunk must be contained in the portion of data written by a single process and thath the number of chunks must be the same for all processes. In this case, I believe that collective I/O is trans-chunk and that multiple chunks are collected and written. If neither of these two cases holds, then in the current implementation of parallel HDF5 it defaults to independent I/O, where each process writes its output independently without any intermediate buffering to prevent contention for write access. See the official page for parallel HDF5 documentation, in particular the parallel HDF5 hints. Supposedly one can use H5Pset_dxpl_mpio_chunk_opt() to specify the mode of collective I/O (linked-chunk or multi-chunk), but the documentation is opaque about the restrictions on the basis of which this specification is accepted.
I want to check in code that eliminates the restriction that the dimensions of the process topology (XLEN,YLEN,ZLEN) must divide the corresponding dimensions of the mesh (nxc, nyc, nzc). Until now this has always been assumed in the code, and since the resolution of issue #10 this assumption has been explicitly enforced.
One of the main motivations for eliminating this restriction is that I/O performance is potentially compromised by the current restriction. The fundamental issue is that, for every nonperiodic dimension, the dimension (e.g. nxn) of the array of nodes that must be saved is one greater than the dimension (e.g. nxc) of the array of cells that must be saved: nxn=nxc+1. (Note that boundary nodes on the face between process subdomains are shared by the two processes). This means that it is impossible under the current restriction for all processes to write the same amount of data if redundant data is to be avoided. This has performance implications for parallel HDF5 in its current state. Eliminating this restriction will allow the user to decrement nxc by 1 so that nxn is divisible by XLEN; if XLEN divides nxc and likewise for the y and z dimensions, then the subarray of nodes written by each process will be the same, allowing for efficient use of chunking and collective I/O.
To give some background, HDF5 allows use of chunking when saving a 3D array. This means that the array is partitioned into 3D chunks, and each chunk has the same dimensions. Collective I/O means that intermediate buffering agents are used to gather information for processes that need to access data belonging to a contiguous region of a file. The relationship between chunk dimensions and the subgrid dimensions of each process determines whether collective I/O can be used and if so whether collective I/O is per-chunk or trans-chunk.
In the current implementation of parallel HDF5, collective I/O is not used when processes are cooperating to write a single array even if collective I/O is requested unless one of the following two requirements is met: the first requirement is that the portion of the array data written by each process must be contained in a single chunk. In which case the data for each chunk is collected and written independently. The documentation of the second case is less clear to me, but it seems that each chunk must be contained in the portion of data written by a single process and thath the number of chunks must be the same for all processes. In this case, I believe that collective I/O is trans-chunk and that multiple chunks are collected and written. If neither of these two cases holds, then in the current implementation of parallel HDF5 it defaults to independent I/O, where each process writes its output independently without any intermediate buffering to prevent contention for write access. See the official page for parallel HDF5 documentation, in particular the parallel HDF5 hints. Supposedly one can use H5Pset_dxpl_mpio_chunk_opt() to specify the mode of collective I/O (linked-chunk or multi-chunk), but the documentation is opaque about the restrictions on the basis of which this specification is accepted.
As noted in these notes from an HDF5 extreme scaling workshop, one can use the method H5Pget_mpio_actual_io_mode() to query whether collective I/O is actually being used and H5Pget_mpio_actual_chunk_opt_mode() to query whether chunking is actually being used.