ecmwf-ifs / ectrans

Global spherical harmonics transforms library underpinning the IFS
Apache License 2.0
19 stars 35 forks source link

Role of LSYNC_TRANS #174

Open samhatfield opened 1 week ago

samhatfield commented 1 week ago

SETUP_TRANS0 has an option LSYNC_TRANS which is defined as activate barriers in trmtol and trltom in TPM_GEN.

In the cpu subtree, this variable does nothing. The barriers in TRMTOL and TRLTOM have been commented out now for 11 years.

In the gpu subtree there are many uses for LSYNC_TRANS, e.g. in TRGTOL, LEINV, LTINV etc. In each case the variable controls whether an MPL_BARRIER is executed, and the time taken to satisfy the barrier is measured using GSTATS.

However it's not clear what all the different uses have in common. I thought initially that LSYNC_TRANS might be used to measure device<->host transfer times, but that doesn't seem to be right.

@lukasm91 could you give us some advice here? Do we need to review the uses of LSYNC_TRANS?

At the very least we should document this option properly, as the description in TPM_GEN is no longer valid.

lukasm91 commented 1 week ago

On the GPU, this option exists since before I started using ectrans, and I really appreciate having it. It is very likely not an option you want to have in operations, but beside that, whenever you are doing performance testing, this LSYNC_TRANS, and the related barrier are extremely useful because you can be sure that you do not attribute load imbalance to the wrong counters.

On GPU I tried to make the option as useful as possible: The idea is that I want to make it possible to understand performance of a) The communications (TRLTOG, TRGTOL, TRLTOM, TRMTOL), and if possible really only the communication, such that the expectation can be that those scale as much as possible. Packing/unpacking does not belong to the communication. b) FFTs - because they are a major component, and they can have significant load imbalance c) GEMMs - because they are a major component, and they can have significant load imbalance d) The whole rest, usually relatively small, and not the major source of load imbalance.

On the CPU, I see that b and c are maybe problematic because they are inside the OpenMP loop, but rather than removing LSYNC_TRANS from GPU, I would rather suggest to make those as meaningful as possible for the CPU.

There is also NTRANS_SYNC_LEVEL. No strong opinion on this - I think on reasonable solution it is not needed, we can either do synchornizations, or we don't, there is no need for a level.

Does that help? If you have more questions let me know :)