eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

Wrap communicators in `RoundRobin` and `Pipeline` in `CommunicatorGrid` #993

Closed msimberg closed 10 months ago

msimberg commented 1 year ago

Nothing to see here yet. Only opening for CI at this point. Better description to follow.

To do:

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

cscs-ci run

msimberg commented 1 year ago

To dos in the PR description are still relevant, but I think this would benefit from some reviews already now. Some notes:

msimberg commented 1 year ago

Binary sizes have evolved like follows (in the CUDA codecov configuration). I'm comparing the latest ancestor on master (https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/commit/e778dc99130efb9e332c6126b98613531b3c7c9c) with the current latest commit on this branch (https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/commit/bb4b5967ea86368a1076aa7c8a1736408be802c6). For the shared libraries the sizes have slightly grown.

master (including tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5288434329#L3084):

-rwxr-xr-x 1 root root 1.5G Oct 13 12:36 src/libDLAF.so
-rwxr-xr-x 1 root root  11M Oct 13 12:34 src/libdlaf.auxiliary.so
-rwxr-xr-x 1 root root 3.9M Oct 13 12:35 src/libdlaf.c_api.so
-rwxr-xr-x 1 root root 348M Oct 13 12:34 src/libdlaf.core.so
-rwxr-xr-x 1 root root 612M Oct 13 12:35 src/libdlaf.eigensolver.so
-rwxr-xr-x 1 root root  98M Oct 13 12:34 src/libdlaf.factorization.so
-rwxr-xr-x 1 root root 161M Oct 13 12:34 src/libdlaf.multiplication.so
-rwxr-xr-x 1 root root 134M Oct 13 12:34 src/libdlaf.permutations.so
-rwxr-xr-x 1 root root 106M Oct 13 12:34 src/libdlaf.solver.so
-rwxr-xr-x 1 root root 244M Oct 13 12:35 src/libdlaf.tridiagonal_eigensolver.so

This branch (including tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5289553001#L3095):

-rwxr-xr-x 1 root root 1.6G Oct 13 14:53 src/libDLAF.so
-rwxr-xr-x 1 root root  11M Oct 13 14:50 src/libdlaf.auxiliary.so
-rwxr-xr-x 1 root root 5.1M Oct 13 14:52 src/libdlaf.c_api.so
-rwxr-xr-x 1 root root 418M Oct 13 14:50 src/libdlaf.core.so
-rwxr-xr-x 1 root root 622M Oct 13 14:52 src/libdlaf.eigensolver.so
-rwxr-xr-x 1 root root 100M Oct 13 14:50 src/libdlaf.factorization.so
-rwxr-xr-x 1 root root 164M Oct 13 14:51 src/libdlaf.multiplication.so
-rwxr-xr-x 1 root root 135M Oct 13 14:51 src/libdlaf.permutations.so
-rwxr-xr-x 1 root root 108M Oct 13 14:50 src/libdlaf.solver.so
-rwxr-xr-x 1 root root 248M Oct 13 14:51 src/libdlaf.tridiagonal_eigensolver.so

There's a small increase in size in most libraries, and a larger increase in size in libdlaf.core.so, presumably because of the added instantiations.

See the full links above for a comparison of the test binary sizes. I think these are still within reason, but not ideal. What do you think?

Unrelated to the changes here, but I noticed the with_temporary_tile test binary is impressively large. I opened an issue to potentially look into that later, but it's low priority: https://github.com/eth-cscs/DLA-Future/issues/1013.

msimberg commented 11 months ago

https://github.com/eth-cscs/DLA-Future/pull/993/files#diff-9fb38d01eaaba80087b718b707650bf8e68af04fe5c2d3acd4d1319a1b159e9eR1502-R1504

How do we deal with internal algorithm implementations?

1. We pass the grid and we get the pipeline from it.
   (In the eigensolver gemm case (link above) it will mean that each gemm has a different pipeline instead of the same (current implementation).

2. The required pipelines are added to the arguments (as for the eigensolver gemm case (link above)).
   Drawback: The internal algorithm has still access to other pipelines via the grid

3. Same as 2, but divide grid in grid_info and {communicators + pipelines} and the internal implementations require a grid_info and the pipelines needed.

I would go for 3.

If I correctly understand what you're suggesting I think I agree with option 3. In that case do we want to introduce a

struct/class CommunicatorInfo {
  int size;
  int rank;
};

(with better naming welcome) and have this something you can query from Communicator and CommunicatorGrid?

We could also consider adding

struct/class CommunicatorPipeline {
  Pipeline<Communicator> pipeline;
  CommunicatorInfo info;
};

so that we don't lose the connection between the original communicator metadata and the pipeline.

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

Ready for another round of reviews. Open questions are:

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

I've resolved what I could resolve, but there are still some open questions. Please have another look at the unresolved comments.

msimberg commented 11 months ago

cscs-ci run

msimberg commented 11 months ago

I've now marked all open comments as resolved. If you would happen to approve this now without further comments, @rasolca please don't merge it yet as I would like to run the benchmarks with the final version first.

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

The test failure is a timeout while creating codecov reports: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5625542709.

albestro commented 10 months ago

The test failure is a timeout while creating codecov reports: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5625542709.

@rasolca is already investigating this. it is happening recently and from the log it is not yet clear what's going on, considering that it seems that all ranks finished with a "done" message (in a time range of a couple of seconds max). almost surely not related to this PR. 😉

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

Another codecov upload timeout: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5672721973. Otherwise CI looks happy now.

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

alby/trisolver-dist-opt-step3 is now merged into this PR. #997 and #998 are ahead in the queue to be merged. However, it will need at least one more conflict resolution.

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

cscs-ci run