Open pkarnakov opened 3 years ago
I took a look to see if I could find why this was designed this way in the first place, and it seems to be a purposeful API design decision.
Throughout the code we allow MPI to be disabled if AMGX_WITH_MPI is not defined. If you were to pass the MPI communicator into the API by value then it would not be possible to accept the typeless void*
and then cast to MPI_Comm
inside the guarded region if MPI is enabled.
This same functionality could be alternatively achieved by having distinct APIs for AMGX_resources_create
when MPI is enabled and disabled. However, this conditionality would need to be followed through all the way to the resource constructor and makes the API more confusing. I think requiring users to pass a pointer to the MPI comm handle is not onerous enough to motivate an API change.
AMGX_resources_create()
would still take a pointer, but the implementation would copy the communicator
diff --git a/base/include/resources.h b/base/include/resources.h
index 1fe13c6..2ac7d6c 100644
--- a/base/include/resources.h
+++ b/base/include/resources.h
@@ -61,7 +61,7 @@ class Resources
int m_high_priority_stream;
int m_serialize_threads;
#ifdef AMGX_WITH_MPI
- MPI_Comm *m_mpi_comm;
+ MPI_Comm m_mpi_comm;
#endif
public:
Resources(AMG_Configuration *cfg, void *comm, int device_num, const int *devices);
@@ -71,7 +71,7 @@ class Resources
ThreadManager *get_tmng() { return m_tmng; }
AMG_Config *getResourcesConfig() { return m_cfg; }
#ifdef AMGX_WITH_MPI
- MPI_Comm *getMpiComm() { return m_mpi_comm; }
+ MPI_Comm *getMpiComm() { return &m_mpi_comm; }
#endif
int getDevice(int device_num) const { return m_devices[device_num]; }
bool getHandleErrors() const { return m_handle_errors; }
diff --git a/base/src/resources.cu b/base/src/resources.cu
index 52a5a5f..a91af8f 100644
--- a/base/src/resources.cu
+++ b/base/src/resources.cu
@@ -150,7 +150,7 @@ Resources::Resources(AMG_Configuration *cfg, void *comm, int device_num, const i
memory::setDeviceMemoryPoolFlag(true);
// create communicator
#ifdef AMGX_WITH_MPI
- m_mpi_comm = (MPI_Comm *) comm;
+ m_mpi_comm = *(MPI_Comm *) comm;
#endif
}
If we are happy to store a temporary copy of the handle when MPI is enabled then this would be an ideal solution. In my mind though changing whether we store a pointer to or value of the handle constitutes an API change (internal rather then external).
The resource creation (AMGX_resources_create) is only expected to be called once in an application but it is possible that different solvers within an application (sharing that resource handle) require different communicators (I believe #109 is an example where this might be relevant). My concern is that if we store the temporary we will lose the ability to change the communicator without re-creating the resources, which is prohibitively expensive.
Now I cannot argue that this is actually working as I haven't tried it out personally and I have no reason to believe this was the original design intention (rather than the reason I speculated before). However, we are trying to make sure we can support applications with multiple distinct solves and even multi-physics applications, where you could expect this sort of requirement to be more common and so abandoning that flexibility seems like heading in the wrong direction. We could maintain different versions of AMGX_resources_create to handle this or some extension of the API but I don't think there is enough benefit.
@marsaev was involved in the earlier stages of development so might have some different thoughts on this.
Good design (and i think this is what typically done in distributed codes) would be duplicate input communicator with MPI_Comm_dup() specifically for the case of comm being released by user's host code. Some minor additional resources (time spent on dup, storage) would be needed for that.
Way of storage doesn't matter that much in that case, but if we revisit this piece of code - i would replace it with shared_ptr
Tracking internally: AMGX-47
AMGX_resources_create()
takes avoid* comm
and calls the constructor ofResources
, which copies the pointer to a member variablehttps://github.com/NVIDIA/AMGX/blob/919fb92a2c02ea3d8d3d0f0660e3b507d84705bc/base/src/resources.cu#L153
This prevents the caller from passing a temporary
MPI_Comm
toAMGX_resources_create()
. An alternative is to store a copy of the communicator in:Is there a reason to store the pointer instead?