Closed msimberg closed 10 months ago
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
To dos in the PR description are still relevant, but I think this would benefit from some reviews already now. Some notes:
CommunicatorGrid
move only. If we suspect that it's going to cause annoying conflicts I can simply revert that commit to make the copy constructor and assignment simply deprecated.MPIPIKA
tests to work around this. For the C API tests I've explicitly specified the number of pipelines to avoid the issue since they also test pika/DLAF initialization/shutdown.Binary sizes have evolved like follows (in the CUDA codecov configuration). I'm comparing the latest ancestor on master
(https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/commit/e778dc99130efb9e332c6126b98613531b3c7c9c) with the current latest commit on this branch (https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/commit/bb4b5967ea86368a1076aa7c8a1736408be802c6). For the shared libraries the sizes have slightly grown.
master
(including tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5288434329#L3084):
-rwxr-xr-x 1 root root 1.5G Oct 13 12:36 src/libDLAF.so
-rwxr-xr-x 1 root root 11M Oct 13 12:34 src/libdlaf.auxiliary.so
-rwxr-xr-x 1 root root 3.9M Oct 13 12:35 src/libdlaf.c_api.so
-rwxr-xr-x 1 root root 348M Oct 13 12:34 src/libdlaf.core.so
-rwxr-xr-x 1 root root 612M Oct 13 12:35 src/libdlaf.eigensolver.so
-rwxr-xr-x 1 root root 98M Oct 13 12:34 src/libdlaf.factorization.so
-rwxr-xr-x 1 root root 161M Oct 13 12:34 src/libdlaf.multiplication.so
-rwxr-xr-x 1 root root 134M Oct 13 12:34 src/libdlaf.permutations.so
-rwxr-xr-x 1 root root 106M Oct 13 12:34 src/libdlaf.solver.so
-rwxr-xr-x 1 root root 244M Oct 13 12:35 src/libdlaf.tridiagonal_eigensolver.so
This branch (including tests: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5289553001#L3095):
-rwxr-xr-x 1 root root 1.6G Oct 13 14:53 src/libDLAF.so
-rwxr-xr-x 1 root root 11M Oct 13 14:50 src/libdlaf.auxiliary.so
-rwxr-xr-x 1 root root 5.1M Oct 13 14:52 src/libdlaf.c_api.so
-rwxr-xr-x 1 root root 418M Oct 13 14:50 src/libdlaf.core.so
-rwxr-xr-x 1 root root 622M Oct 13 14:52 src/libdlaf.eigensolver.so
-rwxr-xr-x 1 root root 100M Oct 13 14:50 src/libdlaf.factorization.so
-rwxr-xr-x 1 root root 164M Oct 13 14:51 src/libdlaf.multiplication.so
-rwxr-xr-x 1 root root 135M Oct 13 14:51 src/libdlaf.permutations.so
-rwxr-xr-x 1 root root 108M Oct 13 14:50 src/libdlaf.solver.so
-rwxr-xr-x 1 root root 248M Oct 13 14:51 src/libdlaf.tridiagonal_eigensolver.so
There's a small increase in size in most libraries, and a larger increase in size in libdlaf.core.so
, presumably because of the added instantiations.
See the full links above for a comparison of the test binary sizes. I think these are still within reason, but not ideal. What do you think?
Unrelated to the changes here, but I noticed the with_temporary_tile
test binary is impressively large. I opened an issue to potentially look into that later, but it's low priority: https://github.com/eth-cscs/DLA-Future/issues/1013.
How do we deal with internal algorithm implementations?
1. We pass the grid and we get the pipeline from it. (In the eigensolver gemm case (link above) it will mean that each gemm has a different pipeline instead of the same (current implementation). 2. The required pipelines are added to the arguments (as for the eigensolver gemm case (link above)). Drawback: The internal algorithm has still access to other pipelines via the grid 3. Same as 2, but divide grid in grid_info and {communicators + pipelines} and the internal implementations require a grid_info and the pipelines needed.
I would go for 3.
If I correctly understand what you're suggesting I think I agree with option 3. In that case do we want to introduce a
struct/class CommunicatorInfo {
int size;
int rank;
};
(with better naming welcome) and have this something you can query from Communicator
and CommunicatorGrid
?
We could also consider adding
struct/class CommunicatorPipeline {
Pipeline<Communicator> pipeline;
CommunicatorInfo info;
};
so that we don't lose the connection between the original communicator metadata and the pipeline.
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
Ready for another round of reviews. Open questions are:
CommunicatorPipeline
. I'm not entirely sure I have the right behaviour there, see https://github.com/eth-cscs/DLA-Future/pull/993#pullrequestreview-1729404029.read
/readwrite
on CommunicatorPipeline
. We don't have to rename them, and being consistent with Matrix
/Pipeline
is probably a good idea, but we have an opportunity to rename them to something more appropriate for communicators.cscs-ci run
cscs-ci run
I've resolved what I could resolve, but there are still some open questions. Please have another look at the unresolved comments.
cscs-ci run
I've now marked all open comments as resolved. If you would happen to approve this now without further comments, @rasolca please don't merge it yet as I would like to run the benchmarks with the final version first.
cscs-ci run
cscs-ci run
cscs-ci run
The test failure is a timeout while creating codecov reports: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5625542709.
The test failure is a timeout while creating codecov reports: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5625542709.
@rasolca is already investigating this. it is happening recently and from the log it is not yet clear what's going on, considering that it seems that all ranks finished with a "done" message (in a time range of a couple of seconds max). almost surely not related to this PR. 😉
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
cscs-ci run
Another codecov upload timeout: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5672721973. Otherwise CI looks happy now.
cscs-ci run
alby/trisolver-dist-opt-step3
is now merged into this PR. #997 and #998 are ahead in the queue to be merged. However, it will need at least one more conflict resolution.
cscs-ci run
cscs-ci run
cscs-ci run
Nothing to see here yet. Only opening for CI at this point. Better description to follow.
To do: