Distributed G4 implementation

PDoakORNL commented 4 years ago

@weilewei will fill this in

weilewei commented 4 years ago

This PR provides an implementation of pipeline ringG algorithm to solve the memory-bound issues in DCA++.

The ringG algorithm works in this way: for each measurement, each accumulator sends its locally generated G2 to a remote accumulator of same threadid in right-hand side neighbor and receives a corresponding G2 from left-hand side neighbor. After receiving a new G2, this accumulator will update its G4, which is 1/p size of original G4 size, where p is the total number of ranks. For each measurement, each locally generated G2 will eventually travel across ranks to update each rank's g4, then the program will perform next measurement. The current implementation can only be used when hardware resources support the NVIDIA NVLink, a high-bandwidth RDMA technology, that allows data transfer between devices. The CPU version of ringG algorithm will be followed and updated (in tp_accumulator.hpp not current tp_accumulator_gpu.hpp) once this PR looks good to the team.
Added wiki and example on distributed-g4-enabled config parameters and how to use this new implementation.
Uploaded cvdlauncher.sh file for using nvlink option on Sumit
Added helper functions where needed and they are extensively documented.
Added a python file for element-wise comparison on g4
Added a unit test for ringG_tp_accumulator_gpu_test
Added MPI_Gatherv helper function to accumulate G4 across ranks but only for testing purposes. As distributed G4 function is normally used when G4 cannot fit into one node, thus the accumulation of G4 won't be possible.

The algorithm complexity heavily depends on network communication times which depends on #ranks and #measurement. As ranks increase, more communication steps will be performed. As #measurments increases, more iterations are needed.

weilewei commented 4 years ago

One caveat is that G4 gets full G4 size allocation on CPU at the very beginning here through a default constructor here. However, with distributed G4 version, the full allocation of G4 is unnecessary or even dangerous. Each rank should only allocate 1/p of full G4 size.

However, the allocation is managed by a generic function, which might be used by other places. I could not simply restrict G4 allocation through the constructor without impacting other functions. Suggestions or PR are welcome. However, this concern shouldn't impact the core part of the distributed G4 algorithm and review. Just saying this out first. Thanks.

weilewei commented 4 years ago

One caveat is that G4 gets full G4 size allocation on CPU at the very beginning here through a default constructor here. However, with distributed G4 version, the full allocation of G4 is unnecessary or even dangerous. Each rank should only allocate 1/p of full G4 size.

However, the allocation is managed by a generic function, which might be used by other places. I could not simply restrict G4 allocation through the constructor without impacting other functions. Suggestions or PR are welcome. However, this concern shouldn't impact the core part of the distributed G4 algorithm and review. Just saying this out first. Thanks.

Thanks @efdazedo for the suggestion via email. This issue now is fixed via commit 337c1f0 and 5f2ce39.

weilewei commented 4 years ago

Once the include issue in mci_parameters.hpp is fixed I will merge this. It's good enough for a first cut and well protected behind its feature switch. I will file an issue for the unit tests, which I am serious about and ideally would have been written as soon as the function signature was known.

Thanks for your review! I think I added missing headers now and I will prepare some unit tests mentioned in your comments soon. Let me know if you have any other suggestions. Thanks!

efdazedo commented 4 years ago

Perhaps there is an important use case where the distributed G4s can fit on a single node (6 GPUs) but not on a single GPU. It would be nice to have a sub-communicator for all MPI ranks on the same compute node. The within node communication in the ring algorithm would then be performed by memory-to-memory copies without accessing the Network. Perhaps use MPI_get_processor_name() to find MPI ranks on the same node. Just a thought.

efdazedo commented 4 years ago

One idea is to consider merging the CPU and GPU kernel as one code. Perhaps something like a collapsed loop to get "id_i" and "id_j" using the following for say the simple computeGSinglebandKernel()

for( ij=ij_start; ij < (n_rowsn_cols); ij += ij_size) {
id_j = ij/n_rows; id_i = (ij - id_jn_rows); assert( ij == (id_i + id_j * n_rows)

where for execution on CPU, we have ij_start == 0, ij_size == 1, for execution on GPU, we have ij_start = blockIdx.x blockDim.x + threadIdx.x; // global thread number ij_size = blockDim.x gridDim.x; // total number of threads

Similar minor code changes for other cuda kernels. Note the GPU Kernel will work correctly even if the total number of threads is less than (n_row * n_col). Just a thought.

PDoakORNL commented 4 years ago

So here's how I'm considering this PR. It's an initial implementation of the G4 distribution from which to develop the architecture we will use for distribution more generally and develop a good enough and maintainable implementation of the G4 distribution. At this point it needs to be understandable, pass unit tests, not be missing significant unit tests and the "production" code needs to be protected from it by a feature switch. Since we have another collaboration I'd like to keep in sync with it to some degree I'd like to get this merged as soon as it does no harm to the production code.

I think @gbalduzz 's general comment is a good punch list for future PR's and we need to copy it somewhere, but those 4 things are beyond the scope of this PR.

weilewei commented 4 years ago

So here's how I'm considering this PR. It's an initial implementation of the G4 distribution from which to develop the architecture we will use for distribution more generally and develop a good enough and maintainable implementation of the G4 distribution. At this point it needs to be understandable, pass unit tests, not be missing significant unit tests and the "production" code needs to be protected from it by a feature switch. Since we have another collaboration I'd like to keep in sync with it to some degree I'd like to get this merged as soon as it does no harm to the production code.

I think @gbalduzz 's general comment is a good punch list for future PR's and we need to copy it somewhere, but those 4 things are beyond the scope of this PR.

Thanks for the consideration and it's a good point. Do we miss anything else for this particular PR?

PDoakORNL commented 4 years ago

I'll have a PR for you soon that fixes a build issue I see with the circular dependence involving mpi_concurrency.

gbalduzz commented 4 years ago

retest this, please

PDoakORNL commented 4 years ago

I've checked compatibility with rb-tree and I think this is safe to go in now.

CompFUSE / DCA

Distributed G4 implementation #201