Closed PDoakORNL closed 4 years ago
This PR provides an implementation of pipeline ringG algorithm to solve the memory-bound issues in DCA++.
The algorithm complexity heavily depends on network communication times which depends on #ranks and #measurement. As ranks increase, more communication steps will be performed. As #measurments increases, more iterations are needed.
One caveat is that G4 gets full G4 size allocation on CPU at the very beginning here through a default constructor here. However, with distributed G4 version, the full allocation of G4 is unnecessary or even dangerous. Each rank should only allocate 1/p of full G4 size.
However, the allocation is managed by a generic function, which might be used by other places. I could not simply restrict G4 allocation through the constructor without impacting other functions. Suggestions or PR are welcome. However, this concern shouldn't impact the core part of the distributed G4 algorithm and review. Just saying this out first. Thanks.
One caveat is that G4 gets full G4 size allocation on CPU at the very beginning here through a default constructor here. However, with distributed G4 version, the full allocation of G4 is unnecessary or even dangerous. Each rank should only allocate 1/p of full G4 size.
However, the allocation is managed by a generic function, which might be used by other places. I could not simply restrict G4 allocation through the constructor without impacting other functions. Suggestions or PR are welcome. However, this concern shouldn't impact the core part of the distributed G4 algorithm and review. Just saying this out first. Thanks.
Thanks @efdazedo for the suggestion via email. This issue now is fixed via commit 337c1f0 and 5f2ce39.
Once the include issue in mci_parameters.hpp is fixed I will merge this. It's good enough for a first cut and well protected behind its feature switch. I will file an issue for the unit tests, which I am serious about and ideally would have been written as soon as the function signature was known.
Thanks for your review! I think I added missing headers now and I will prepare some unit tests mentioned in your comments soon. Let me know if you have any other suggestions. Thanks!
Perhaps there is an important use case where the distributed G4s can fit on a single node (6 GPUs) but not on a single GPU. It would be nice to have a sub-communicator for all MPI ranks on the same compute node. The within node communication in the ring algorithm would then be performed by memory-to-memory copies without accessing the Network. Perhaps use MPI_get_processor_name() to find MPI ranks on the same node. Just a thought.
One idea is to consider merging the CPU and GPU kernel as one code. Perhaps something like a collapsed loop to get "id_i" and "id_j" using the following for say the simple computeGSinglebandKernel()
for( ij=ij_start; ij < (n_rowsn_cols); ij += ij_size) {
id_j = ij/n_rows;
id_i = (ij - id_jn_rows);
assert( ij == (id_i + id_j * n_rows)
where for execution on CPU, we have ij_start == 0, ij_size == 1, for execution on GPU, we have ij_start = blockIdx.x blockDim.x + threadIdx.x; // global thread number ij_size = blockDim.x gridDim.x; // total number of threads
Similar minor code changes for other cuda kernels. Note the GPU Kernel will work correctly even if the total number of threads is less than (n_row * n_col). Just a thought.
So here's how I'm considering this PR. It's an initial implementation of the G4 distribution from which to develop the architecture we will use for distribution more generally and develop a good enough and maintainable implementation of the G4 distribution. At this point it needs to be understandable, pass unit tests, not be missing significant unit tests and the "production" code needs to be protected from it by a feature switch. Since we have another collaboration I'd like to keep in sync with it to some degree I'd like to get this merged as soon as it does no harm to the production code.
I think @gbalduzz 's general comment is a good punch list for future PR's and we need to copy it somewhere, but those 4 things are beyond the scope of this PR.
So here's how I'm considering this PR. It's an initial implementation of the G4 distribution from which to develop the architecture we will use for distribution more generally and develop a good enough and maintainable implementation of the G4 distribution. At this point it needs to be understandable, pass unit tests, not be missing significant unit tests and the "production" code needs to be protected from it by a feature switch. Since we have another collaboration I'd like to keep in sync with it to some degree I'd like to get this merged as soon as it does no harm to the production code.
I think @gbalduzz 's general comment is a good punch list for future PR's and we need to copy it somewhere, but those 4 things are beyond the scope of this PR.
Thanks for the consideration and it's a good point. Do we miss anything else for this particular PR?
I'll have a PR for you soon that fixes a build issue I see with the circular dependence involving mpi_concurrency.
retest this, please
I've checked compatibility with rb-tree and I think this is safe to go in now.
@weilewei will fill this in