miniapp communication - Githubissues

eth-cscs / DLA-Future

DLA-Future

https://eth-cscs.github.io/DLA-Future/master/

BSD 3-Clause "New" or "Revised" License

64 stars 14 forks source link

miniapp communication #1105

Closed rasolca closed 6 months ago

rasolca commented 7 months ago

Moved the template scheduling functions to internal headers (used by the miniapp as well)
Renamed the scheduling functions according to the new naming conventions.
implemented communication miniapp which test the current default settings and the relevant combinations, to allow a comparison to find the best combination.

msimberg commented 7 months ago

@rasolca looks reasonable to me. Just to check my understanding, what you're benchmarking is a broadcast from rank 0 to all other ranks:

with the default options for contiguous/non-contiguous
with all the combinations of CPU/GPU memory, contiguous/non-contiguous send/recv

Probably my ignorance, but do you need ensure that each rank will do as many receives as the root rank does sends? Or since it's a broadcast, do you not need to do a receive on every rank?

rasolca commented 7 months ago

@rasolca looks reasonable to me. Just to check my understanding, what you're benchmarking is a broadcast from rank 0 to all other ranks:
* with the default options for contiguous/non-contiguous

* with all the combinations of CPU/GPU memory, contiguous/non-contiguous send/recv

Exactly. From the given backend B (which decides where the data is located) I try all the available combinations.

Probably my ignorance, but do you need ensure that each rank will do as many receives as the root rank does sends? Or since it's a broadcast, do you not need to do a receive on every rank?

The idea is to allocate a local matrix with the same size on all the ranks. I still have a couple of open TODO when creating the matrix. It might be that I'm still using a distributed matrix by mistake.

rasolca commented 7 months ago

Summary of the first benchs (A100):

bcast p2p
- force contiguous is faster for CPU communication as well (especially bcast, but for p2p as well if the ranks are on different node).
- copying CPU to GPU and communicate is even faster
- non contiguous GPU communication is performing terribly
reduce/all_reduce:
- on a single node GPU communication is slightly faster
- on two nodes copy to CPU and communicate is faster

rasolca commented 7 months ago

Note: Compilation of the new miniapp is very slow.

rasolca commented 7 months ago

cscs-ci run

rasolca commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

Note: Compilation of the new miniapp is very slow.

Likely due to the same reason as #1013. I have been getting increasingly annoyed by compile times of the other miniapps recently as well, though I don't think it's necessarily gotten worse. It may be worth bumping this up my to do list, or if someone else feels motivated they could look into it.

rasolca commented 6 months ago

cscs-ci run