Closed dqwu closed 6 years ago
With GPTL timers the bottleneck has been identified in the following code:
int box_rearrange_create(...)
{
...
/* For each IO task send starts/counts to all compute tasks. */
for (int i = 0; i < ios->num_iotasks; i++)
{
...
/* The count array from iotask i is sent to all compute tasks. */
...
if ((ret = pio_swapm(iodesc->firstregion->count, ...)))
...
/* The start array from iotask i is sent to all compute tasks. */
...
if ((ret = pio_swapm(iodesc->firstregion->start, ...)))
...
}
...
}
Note there are two back-to-back pio_swapm calls inside a loop. In an E3SM Benchmark F case run, each call was executed 11,700 times. The first one (on count array) took 119.684 seconds, but the second one (on start array) took 1469.004 seconds. If we swap the two calls, the result is the same (the second call always takes more time).
Thanks. I will look into this issue.
One improvement has been confirmed. If we combine the two pio_swapm calls into a single one (send/receive start and count together, use arrays of doubled size), the total time for that single pio_swapm call is only 70 seconds. As a result, PIOc_initdecomp time is reduced from 1603 seconds to 103 seconds. The case run time is reduced from 42 minutes to 17 minutes.
@dqwu : Can you issue a PR with the fix?
@jayeshkrishna I think we can even get rid of the loop (at a cost of a local larger-size array to store starts and counts). We can look at how llen is processed (it does not need a loop for pio_swapm, but does use a larger array)
/* All-gather the llen to all tasks into array iomaplen. */
LOG((3, "calling pio_swapm to allgather llen into array iomaplen, ndims = %d dtypes[0] = %d",
ndims, dtypes));
if ((ret = pio_swapm(&iodesc->llen, sendcounts, sdispls, dtypes, iomaplen, recvcounts,
rdispls, dtypes, ios->union_comm, &iodesc->rearr_opts.io2comp)))
return pio_err(ios, NULL, ret, __FILE__, __LINE__);
ok, go ahead and fix that issue too. But I would recommend merging your current fix to develop/master since it significantly improves performance. I am assigning this issue to you.
@dqwu: After a quick look at the function I think we can simplify the algorithm further, please merge your current changes and then I can work on simplifying the algorithm further.
This issue has been confirmed by GPTL timing information of some high-resolution ACME F cases run on supercomputers.
[Cori@NERSC, 17408 MPI tasks] PIO1: 1 min 25 sec for 76 PIO_initdecomp calls, 1.12 sec per call PIO2: 17 min 34 sec for 76 PIO_initdecomp calls, 13.87 sec per call
[Titan@OLCF, 43200 MPI tasks] PIO1: 2 min 45 sec for 76 PIO_initdecomp calls, 2.17 sec per call PIO2: 67 min 21 sec for 76 PIO_initdecomp calls, 53.17 sec per call
When box rearranger is used, PIO_initdecomp calls box_rearrange_create. Most of the run time is spent in pio_swapm (called by box_rearrange_create). box_rearrange_create is implemented differently in PIO1, which uses "gather and then broadcast" instead of calling pio_swapm.
PIO2 should try to reduce the run time of box_rearrange_create (comparable to PIO1) with a different implementation.