Open samhatfield opened 1 week ago
On the GPU, this option exists since before I started using ectrans, and I really appreciate having it. It is very likely not an option you want to have in operations, but beside that, whenever you are doing performance testing, this LSYNC_TRANS, and the related barrier are extremely useful because you can be sure that you do not attribute load imbalance to the wrong counters.
On GPU I tried to make the option as useful as possible: The idea is that I want to make it possible to understand performance of a) The communications (TRLTOG, TRGTOL, TRLTOM, TRMTOL), and if possible really only the communication, such that the expectation can be that those scale as much as possible. Packing/unpacking does not belong to the communication. b) FFTs - because they are a major component, and they can have significant load imbalance c) GEMMs - because they are a major component, and they can have significant load imbalance d) The whole rest, usually relatively small, and not the major source of load imbalance.
On the CPU, I see that b and c are maybe problematic because they are inside the OpenMP loop, but rather than removing LSYNC_TRANS from GPU, I would rather suggest to make those as meaningful as possible for the CPU.
There is also NTRANS_SYNC_LEVEL. No strong opinion on this - I think on reasonable solution it is not needed, we can either do synchornizations, or we don't, there is no need for a level.
Does that help? If you have more questions let me know :)
SETUP_TRANS0
has an optionLSYNC_TRANS
which is defined asactivate barriers in trmtol and trltom
inTPM_GEN
.In the
cpu
subtree, this variable does nothing. The barriers inTRMTOL
andTRLTOM
have been commented out now for 11 years.In the
gpu
subtree there are many uses forLSYNC_TRANS
, e.g. inTRGTOL
,LEINV
,LTINV
etc. In each case the variable controls whether anMPL_BARRIER
is executed, and the time taken to satisfy the barrier is measured using GSTATS.However it's not clear what all the different uses have in common. I thought initially that
LSYNC_TRANS
might be used to measure device<->host transfer times, but that doesn't seem to be right.@lukasm91 could you give us some advice here? Do we need to review the uses of
LSYNC_TRANS
?At the very least we should document this option properly, as the description in
TPM_GEN
is no longer valid.