E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications
18 stars 16 forks source link

Slow performance of pio_read_darray with ADIOS IO type #562

Closed dqwu closed 4 months ago

dqwu commented 4 months ago

During a benchmark IO performance test employing the scream eamxx ne256 F case on Frontier (768 MPI tasks, 96 nodes, 8 MPI tasks per node, and 1 IO task per node), it was observed that the ADIOS read operation (for reading back the restart data) is considerably slower compared to PnetCDF read.

[ADIOS IO type]

    "avg_rtput(MB/s)": 1041.038119
    "tot_rb(bytes)": 145219393536
    "tot_rtime(s)": 133.032598

[PnetCDF IO type]

    "avg_rtput(MB/s)": 28625.460208
    "tot_rb(bytes)": 145219393536
    "tot_rtime(s)": 4.838071

In another test, GPTL timers reveal that pio_read_darray takes a tremendous amount of time, contributing significantly to the overall read performance degradation:

"a_i:PIO:PIOc_read_darray"                                    -        768      768 3.148800e+04   5.334925e+04   134.149 (   767      0)     2.679 (     1      0)
"a_i:PIO:PIOc_read_darray_adios"                              -        768      768 1.996800e+04   5.281263e+04   133.477 (   767      0)     2.018 (     1      0)

This performance issue can be replicated with a standalone SCORPIO C test on Frontier using the same PE layout. The test involves writing a single 6D record variable (approximately 54 GB in size) with pio_write_darray and subsequently reading it back with pio_read_darray.

[ADIOS IO Type]

    Max read time: 81.463893
    Min read time: 0.187587
    Average read time: 43.897539

[PnetCDF IO Type]

    Max read time: 1.046299
    Min read time: 0.860642
    Average read time: 0.977389

@pnorbert successfully reproduced this issue with 24 Frontier nodes and 768 MPI tasks. A trace was generated, revealing an intriguing observation: the higher the rank, the more frequent calls to BP5Reader::PerformGets() are made.

dqwu commented 4 months ago

A standalone SCORPIO C test to reproduce this performance issue on Frontier:

#include <stdio.h>
#include <mpi.h>
#include <pio.h>
#ifdef TIMING
#include <gptl.h>
#endif

#define ERR { if (ret != PIO_NOERR) printf("rank = %d, error at line = %d\n", my_rank, __LINE__); }

int main(int argc, char* argv[])
{
    const int dim10_len = 10;
    const int dim2_len = 2;
    const int elem_len = 393216;
    const int gp_len = 4;
    const int ilev_len = 129;
    const int lev_len = 128;
    const int ncol_len = 1572864;
    const int element_per_pe = 10485760;
    const int niotasks = 96;
    const int ioproc_stride = 8;
    const int ioproc_start = 0;

    int my_rank;
    int ntasks;

    int format = PIO_IOTYPE_ADIOS; /* Change to PIO_IOTYPE_PNETCDF for comparison */

    int iosysid;

    int ncid;
    int ncid_read;

    int dimids[6];
    int dimid_time;
    int dimid_dim10;
    int dimid_dim2;
    int dimid_elem;
    int dimid_gp;
    int dimid_ilev;
    int dimid_lev;
    int dimid_ncol;

    int varid_write;
    int varid_time;
    int varid_Qdp_dyn;

    int ioid_double_5D;
    int gdimlen[5] = {elem_len, dim10_len, gp_len, gp_len, lev_len};

    PIO_Offset* compmap = calloc(element_per_pe, sizeof(PIO_Offset));
    double* write_buffer = calloc(element_per_pe, sizeof(double));
    double* read_buffer = calloc(element_per_pe, sizeof(double));

    double read_time;
    double max_read_time;
    double min_read_time;
    double sum_read_time;

    int ret = PIO_NOERR;

#ifdef TIMING
    GPTLinitialize();
#endif

    MPI_Init(&argc, &argv);

    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &ntasks);

    if (ntasks != 768)
    {
        if (my_rank == 0)
            printf("This test must be run with exactly 768 MPI tasks!\n");

        MPI_Finalize();

#ifdef TIMING
        GPTLfinalize();
#endif

        return -1;
    }

    for (int i = 0; i < element_per_pe; i++)
    {
        compmap[i] = my_rank * (PIO_Offset)element_per_pe + i + 1;
        write_buffer[i] = my_rank;
    }

    ret = PIOc_Init_Intracomm(MPI_COMM_WORLD, niotasks, ioproc_stride, ioproc_start, PIO_REARR_BOX, &iosysid); ERR

    ret = PIOc_InitDecomp(iosysid, PIO_DOUBLE, 5, gdimlen, element_per_pe, compmap, &ioid_double_5D, NULL, NULL, NULL); ERR

    ret = PIOc_createfile(iosysid, &ncid, &format, "test_read.nc", PIO_CLOBBER); ERR

    ret = PIOc_def_dim(ncid, "time", PIO_UNLIMITED, &dimid_time); ERR
    ret = PIOc_def_dim(ncid, "dim10", dim10_len, &dimid_dim10); ERR
    ret = PIOc_def_dim(ncid, "dim2", dim2_len, &dimid_dim2); ERR
    ret = PIOc_def_dim(ncid, "elem", elem_len, &dimid_elem); ERR
    ret = PIOc_def_dim(ncid, "gp", gp_len, &dimid_gp); ERR
    ret = PIOc_def_dim(ncid, "ilev", ilev_len, &dimid_ilev); ERR
    ret = PIOc_def_dim(ncid, "lev", lev_len, &dimid_lev); ERR
    ret = PIOc_def_dim(ncid, "ncol", ncol_len, &dimid_ncol); ERR

    ret = PIOc_def_var(ncid, "time", PIO_DOUBLE, 1, &dimid_time, &varid_time); ERR

    dimids[0] = dimid_time;
    dimids[1] = dimid_elem;
    dimids[2] = dimid_dim10;
    dimids[3] = dimid_gp;
    dimids[4] = dimid_gp;
    dimids[5] = dimid_lev;
    ret = PIOc_def_var(ncid, "Qdp_dyn", PIO_DOUBLE, 6, dimids, &varid_Qdp_dyn); ERR

    ret = PIOc_enddef(ncid); ERR

    ret = PIOc_setframe(ncid, varid_Qdp_dyn, 0); ERR
    ret = PIOc_write_darray(ncid, varid_Qdp_dyn, ioid_double_5D, element_per_pe, write_buffer, NULL); ERR

    ret = PIOc_closefile(ncid); ERR

    ret = PIOc_openfile(iosysid, &ncid_read, &format, "test_read.nc", PIO_NOWRITE); ERR

    varid_Qdp_dyn = -1;
    ret = PIOc_inq_varid(ncid_read, "Qdp_dyn", &varid_Qdp_dyn); ERR

    MPI_Barrier(MPI_COMM_WORLD);

    read_time = MPI_Wtime();
    ret = PIOc_read_darray(ncid_read, varid_Qdp_dyn, ioid_double_5D, element_per_pe, read_buffer); ERR
    read_time = MPI_Wtime() - read_time;

    MPI_Reduce(&read_time, &max_read_time, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD);
    MPI_Reduce(&read_time, &min_read_time, 1, MPI_DOUBLE, MPI_MIN, 0, MPI_COMM_WORLD);
    MPI_Reduce(&read_time, &sum_read_time, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

    if (my_rank == 0)
    {
        printf("max read time = %lf, min read time = %lf, average read time = %lf\n",
               max_read_time, min_read_time, sum_read_time / ntasks);
    }

    ret = PIOc_closefile(ncid_read); ERR

    ret = PIOc_freedecomp(iosysid, ioid_double_5D); ERR

    free(compmap);
    free(write_buffer);
    free(read_buffer);

    ret = PIOc_finalize(iosysid); ERR

    MPI_Finalize();

#ifdef TIMING
    GPTLfinalize();
#endif

    return 0;
}
dqwu commented 4 months ago

Initial comment from @pnorbert

pio_darray.c:2901

Each process needs to find their piece in the decomp/512 variable, which is 96 blocks. They loop through the adios blocks of this variable, read in one block and check if their index is within the loaded block. If they find it, they cache this block and quit the loop. The last 8 processes in this run will find its index in the last block but to get there they will ask adios to open 96 data.X files and read one block of data from each, reading 96x times more than the first 8 processes do.

I don't have a solution in mind for this. Would be nice to know which block to read in for the indices (but we don't know without reading them), or would be nice to have the decomp variables in a single file (which would require rank 0 to gather this data from everyone and write alone). Or manage the decomposition variable in a distributed manner in the reading processes (which I don't know how to do).

I wonder why Dmitry did not see this in his tests since the high ranks deterministically read more blocks from more data files.

dqwu commented 4 months ago

A patch provided by @pnorbert

diff --git a/src/clib/pio_darray.c b/src/clib/pio_darray.c
index c092ed20..410d4438 100644
--- a/src/clib/pio_darray.c
+++ b/src/clib/pio_darray.c
@@ -2898,7 +2898,14 @@ static int PIOc_read_darray_adios(file_desc_t *file, int fndims, io_desc_t *iode
         int64_t *decomp_int64_t = NULL;
         start_block_found = false;

-        for (size_t block = 0; block < decomp_blocks_size; block++)
+        iosystem_desc_t *ios;  /* Pointer to io system information. */
+        ios = file->iosystem;
+        int s = ios->num_comptasks / decomp_blocks_size;
+        int guess_block = ios->comp_rank/s;
+
+        /* Loop through the blocks starting from guess block to end, then 0 to guess_block - 1*/
+        size_t block = guess_block;
+        while(true)
         {
             adiosErr = adios2_set_block_selection(decomp_adios_var, block);
             if (adiosErr != adios2_error_none)
@@ -2966,6 +2973,12 @@ static int PIOc_read_darray_adios(file_desc_t *file, int fndims, io_desc_t *iode

             if (start_block_found)
                 break;
+
+            ++block;
+            if (block == decomp_blocks_size)
+                block = 0;
+            if (block == guess_block)
+                break; /* Arrived back to first block in search */
         }

         if (!start_block_found)

Latest comment from @pnorbert :

I figured we actually know the block we need when we restart with the same number of processes.

int s = ios->num_comptasks / decomp_blocks_size;
int guess_block = ios->comp_rank/s;

I don't know if this is true for all possible cases, so I changed the for loop in pio_darray.c to start with this block and loop through all of them just in case to end, then from 0 to back to this block.

The time result I got on frontier is now srun -n 768 -N 96 --cpu_bind=cores /ccs/proj/e2e/pnorbert/scorpio-first-io-issue/build/examples/c/test_adios max read time = 0.790125, min read time = 0.172398, average read time = 0.493598