Closed rasolca closed 6 months ago
@rasolca looks reasonable to me. Just to check my understanding, what you're benchmarking is a broadcast from rank 0 to all other ranks:
Probably my ignorance, but do you need ensure that each rank will do as many receives as the root rank does sends? Or since it's a broadcast, do you not need to do a receive on every rank?
@rasolca looks reasonable to me. Just to check my understanding, what you're benchmarking is a broadcast from rank 0 to all other ranks:
* with the default options for contiguous/non-contiguous * with all the combinations of CPU/GPU memory, contiguous/non-contiguous send/recv
Exactly. From the given backend B (which decides where the data is located) I try all the available combinations.
Probably my ignorance, but do you need ensure that each rank will do as many receives as the root rank does sends? Or since it's a broadcast, do you not need to do a receive on every rank?
The idea is to allocate a local matrix with the same size on all the ranks. I still have a couple of open TODO when creating the matrix. It might be that I'm still using a distributed matrix by mistake.
Summary of the first benchs (A100):
Note: Compilation of the new miniapp is very slow.
cscs-ci run
cscs-ci run
Note: Compilation of the new miniapp is very slow.
Likely due to the same reason as #1013. I have been getting increasingly annoyed by compile times of the other miniapps recently as well, though I don't think it's necessarily gotten worse. It may be worth bumping this up my to do list, or if someone else feels motivated they could look into it.
cscs-ci run