PIO_initdecomp might take a long time if box rearranger is used in PIO2

E3SM-Project / scorpio

A high-level Parallel I/O Library for structured grid applications

Other

21 stars 16 forks source link

PIO_initdecomp might take a long time if box rearranger is used in PIO2 #80

Closed dqwu closed 6 years ago

dqwu commented 6 years ago

This issue has been confirmed by GPTL timing information of some high-resolution ACME F cases run on supercomputers.

[Cori@NERSC, 17408 MPI tasks] PIO1: 1 min 25 sec for 76 PIO_initdecomp calls, 1.12 sec per call PIO2: 17 min 34 sec for 76 PIO_initdecomp calls, 13.87 sec per call

[Titan@OLCF, 43200 MPI tasks] PIO1: 2 min 45 sec for 76 PIO_initdecomp calls, 2.17 sec per call PIO2: 67 min 21 sec for 76 PIO_initdecomp calls, 53.17 sec per call

When box rearranger is used, PIO_initdecomp calls box_rearrange_create. Most of the run time is spent in pio_swapm (called by box_rearrange_create). box_rearrange_create is implemented differently in PIO1, which uses "gather and then broadcast" instead of calling pio_swapm.

PIO2 should try to reduce the run time of box_rearrange_create (comparable to PIO1) with a different implementation.

dqwu commented 6 years ago

With GPTL timers the bottleneck has been identified in the following code:

int box_rearrange_create(...)
{
    ...
    /* For each IO task send starts/counts to all compute tasks. */
    for (int i = 0; i < ios->num_iotasks; i++)
    {
        ...
        /* The count array from iotask i is sent to all compute tasks. */
        ...
        if ((ret = pio_swapm(iodesc->firstregion->count, ...)))
        ...
        /* The start array from iotask i is sent to all compute tasks. */
        ...
        if ((ret = pio_swapm(iodesc->firstregion->start, ...)))
        ...
    }
    ...
}

Note there are two back-to-back pio_swapm calls inside a loop. In an E3SM Benchmark F case run, each call was executed 11,700 times. The first one (on count array) took 119.684 seconds, but the second one (on start array) took 1469.004 seconds. If we swap the two calls, the result is the same (the second call always takes more time).

jayeshkrishna commented 6 years ago

Thanks. I will look into this issue.

dqwu commented 6 years ago

One improvement has been confirmed. If we combine the two pio_swapm calls into a single one (send/receive start and count together, use arrays of doubled size), the total time for that single pio_swapm call is only 70 seconds. As a result, PIOc_initdecomp time is reduced from 1603 seconds to 103 seconds. The case run time is reduced from 42 minutes to 17 minutes.

jayeshkrishna commented 6 years ago

@dqwu : Can you issue a PR with the fix?

dqwu commented 6 years ago

@jayeshkrishna I think we can even get rid of the loop (at a cost of a local larger-size array to store starts and counts). We can look at how llen is processed (it does not need a loop for pio_swapm, but does use a larger array)

    /* All-gather the llen to all tasks into array iomaplen. */
    LOG((3, "calling pio_swapm to allgather llen into array iomaplen, ndims = %d dtypes[0] = %d",
         ndims, dtypes));
    if ((ret = pio_swapm(&iodesc->llen, sendcounts, sdispls, dtypes, iomaplen, recvcounts,
                         rdispls, dtypes, ios->union_comm, &iodesc->rearr_opts.io2comp)))
        return pio_err(ios, NULL, ret, __FILE__, __LINE__);

jayeshkrishna commented 6 years ago

ok, go ahead and fix that issue too. But I would recommend merging your current fix to develop/master since it significantly improves performance. I am assigning this issue to you.

jayeshkrishna commented 6 years ago

@dqwu: After a quick look at the function I think we can simplify the algorithm further, please merge your current changes and then I can work on simplifying the algorithm further.