when Multi-processors use ga::GET access the same region, become slow

jsboer commented 1 year ago

HI, I use GA to create a job pool(with many elements, correspond to many independent job pools), and all processors go to acquire the job. Before acquire the job, I will use ga::get to query whether the job pool are empty. But I find when jobs are little, all processors will use get to query, and the query becomes very slow. Is this because that when use ga::get the accessed memory will be locked? Actually I do not need the lock, it will be ok if the memory are wrote by other processors after ga::get.

If really there is a implicit lock, can I remove it? By the way, does only accessed part of memory will be locked?

Thank you!

jeffhammond commented 1 year ago

This is complicated, because the answer is most likely dependent on the back-end details, i.e. how ARMCI is implemented. Furthermore, with a back-end based on MPI RMA, there are multiple options, because ARMCI-MPI has these, and MPI implementations vary in how they do things.

In any case, I expect that GA has locks to achieve location consistency, which is similar to sequential consistency. This isn't formally defined anywhere, but it is likely that some GA codes depend on it, although I know most of NWChem does not.

At least for some ARMCI conduits, there should be address range locking, which means good throughput on disjoint accesses, but won't help at all for accessing the same memory, even though concurrent reads do not need it.

I don't know if it is possible to disable locking in GA/ARMCI in general but I think you can do when you use GA with ARMCI-MPI by setting ARMCI_RMA_ATOMICITY=0 at runtime. This is not the default because it likely violates the historical semantics of ARMCI, but it should not do any locks of Put or Get, only Acc and Rmw.

jsboer commented 1 year ago

This is complicated, because the answer is most likely dependent on the back-end details, i.e. how ARMCI is implemented. Furthermore, with a back-end based on MPI RMA, there are multiple options, because ARMCI-MPI has these, and MPI implementations vary in how they do things.

In any case, I expect that GA has locks to achieve location consistency, which is similar to sequential consistency. This isn't formally defined anywhere, but it is likely that some GA codes depend on it, although I know most of NWChem does not.

At least for some ARMCI conduits, there should be address range locking, which means good throughput on disjoint accesses, but won't help at all for accessing the same memory, even though concurrent reads do not need it.

I don't know if it is possible to disable locking in GA/ARMCI in general but I think you can do when you use GA with ARMCI-MPI by setting ARMCI_RMA_ATOMICITY=0 at runtime. This is not the default because it likely violates the historical semantics of ARMCI, but it should not do any locks of Put or Get, only Acc and Rmw.

Hi, thanks for your help, I still have two questions:

does this set ARMCI_RMA_ATOMICITY=0 must be global? Can I just set it for part of the code?
if I set ARMCI_RMA_ATOMICITY=0`, will these influence the correctness of the result when I use both put and get to the same memory address? The correctness means that if I ga::put and ga::get the same memory segment consisting of 3 addresses, can the data acquired by ga::get be assured that all 3 data are all either after ga::put or before ga::put ? Or maybe of these 3 data, some are before ga::put and some are after ga::put, which is seldom?

jeffhammond commented 1 year ago

Currently, ARMCI_RMA_ATOMICITY=0 is a global setting done in the initialization of ARMCI-MPI, but I can change that if I want. However, if I change it, then you'll have to write ARMCI-MPI specific code, and you'll have to put a preprocessor macro in your build system to guard it, because GA (rightly) doesn't expose the ARMCI configuration in the header file.

If you set this option, your code will be correct as long as you do only one of Get, Put or Acc in a phase, where phases are delineated by GA::Sync calls. Put won't be atomic anymore either, so concurrent Puts to the same location will become undefined, whereas with ARMCI_RMA_ATOMICITY=1, they'll be atomic, which is the historical GA/ARMCI semantic (as best I can tell).

However, if you want all this fine-grain control, you might just use MPI RMA directly. I don't know which parts of GA you rely on, but you can get the Put/Get/Acc/Rmw part of GA from MPI RMA rather easily. On the other hand, you will have to reproduce any GA math library wrappers and figure out the array distribution yourself (GA's code for this is nontrivial). See https://eecs.wsu.edu/~assefaw/publications/icpp2016-elemental.pdf for an example.

GlobalArrays / ga

when Multi-processors use ga::GET access the same region, become slow #309