NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
487 stars 142 forks source link

feature request : enable multiple instances of the resource handle #109

Open Jaberh opened 4 years ago

Jaberh commented 4 years ago

Hi Marat I had to open a new issue as I was not sure if you get notifications for the issue that is closed. I have one more question, So the solver supports multi stream solves (such as solving for solid and fluid at the same setup, not related to multi-stream in cuda), I have an interface class and hence I construct several objects using different configs, the solution is correct but since now I free resources (although they belong to different instances of the same class), I get an error in the clean up phase as follows, If I have a single object there are no issues.

AMGX_solver_destroy() 
!!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!!

if I comment out AMGX_SAFE_CALL(AMGX_resources_destroy(m_resources)); then the error changes to, which makes sense as it detects the non freed resources handle.

*** Process received signal ***
 Signal: Segmentation fault (11)
Signal code:  (128)
Failing at address: (nil)
[ 0] /lib64/libpthread.so.0(+0xf630)[0x7f85af5b5630]
[ 1] /lib64/libcuda.so.1(+0x1f3b8d)[0x7f856b8e1b8d]
[ 2] /lib64/libcuda.so.1(+0x1ddbc7)[0x7f856b8cbbc7]
[ 3] /lib64/libcuda.so.1(+0xf9b4b)[0x7f856b7e7b4b]
[ 4] /lib64/libcuda.so.1(cuEventDestroy_v2+0x59)[0x7f856b969ae9]
[ 5] centos/7/nvidia/cuda/10.2.89/lib64/libcublas.so.10(+0x5e8dd0)[0x7f858178cdd0]
[ 6] /tools/centos/7/nvidia/cuda/10.2.89/lib64/libcublas.so.10(+0x61cea4)[0x7f85817c0ea4]
[ 7] /tools/centos/7/nvidia/cuda/10.2.89/lib64/libcublas.so.10(+0x29aad)[0x7f85811cdaad]
[ 8] /tools/centos/7/nvidia/cuda/10.2.89/lib64/libcublas.so.10(+0x2ae16)[0x7f85811cee16]
[ 9] /centos/7/nvidia/cuda/10.2.89/lib64/libcublas.so.10(cublasDestroy_v2+0xe7)[0x7f858125cf77]
[10] libamgxsh.so(_ZN4amgx6Cublas14destroy_handleEv+0x25)[0x7f8588d15085]
[11] libamgxsh.so(_ZN4amgx9ResourcesD1Ev+0x5d)[0x7f8588d14ead]
[12] lib/libamgxsh.so(_ZNSt15_Sp_counted_ptrIPN4amgx9ResourcesELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0x12)[0x7f8587dcff32]
[13] libamgxsh.so(_ZNSt15_Sp_counted_ptrIPN4amgx11CWrapHandleIP28AMGX_resources_handle_structNS0_9ResourcesEEELN9__gnu_cxx12_Lock_policyE2EE10_M_disposeEv+0xba)[0x7f8587dd308a]
[14] libamgxsh.so(_ZNSt8_Rb_treeIPN4amgx11CWrapHandleIP28AMGX_resources_handle_structNS0_9ResourcesEEESt4pairIKS6_St10shared_ptrIS5_EESt10_Select1stISB_ESt4lessIS6_ESaISB_EE8_M_eraseEPSt13_Rb_tree_nodeISB_E+
[15] libamgxsh.so(_ZN4amgx10MemManagerIJNS_11CWrapHandleIP28AMGX_resources_handle_structNS_9ResourcesEEEEED1Ev+0x2c)[0x7f8587e42e5c]
[16] /lib64/libc.so.6(+0x39ce9)[0x7f8586377ce9]
[17] /lib64/libc.so.6(+0x39d37)[0x7f8586377d37]
[18] /lib64/libc.so.6(__libc_start_main+0xfc)[0x7f858636055c]
[19]
*** End of error message ***

Hopefully this is the last test case, I would like to know your advice on this before debugging further I highly doubt that order of destruction is the issue as it would affect the single object case but I am listing it here just for the sake of completeness

      AMGX_SAFE_CALL(AMGX_vector_destroy(m_rhs));
      AMGX_SAFE_CALL(AMGX_vector_destroy(m_solution));
      AMGX_SAFE_CALL(AMGX_matrix_destroy(m_matrix));
       AMGX_SAFE_CALL(AMGX_solver_destroy(m_solver));
      AMGX_SAFE_CALL(AMGX_resources_destroy(m_resources));
      AMGX_SAFE_CALL(AMGX_config_destroy(m_config));

It is probably related to

   size_t n_erased = get_mode_bookkeeper<Envelope>().erase(envl);
    bool flag = get_mem_manager<LetterW>().template free<LetterW>(letter);
marsaev commented 4 years ago

Multiple solver instances is supported. AMGX initialization should be called once and one resource handle should be created and used for all sovler, matrix and vector instances. There are parts of the code that are not thread safe, but let me recheck it. Does your interface class creates individual AMGX_Resource handle for each instance?

AMGX_solver_destroy() !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!!

That is surprising. This message should be printed only during AMGX_resource_free() or AMGX_finalize().

AMGX_SAFE_CALL(AMGX_vector_destroy(m_rhs)); AMGX_SAFE_CALL(AMGX_vector_destroy(m_solution)); AMGX_SAFE_CALL(AMGX_matrix_destroy(m_matrix)); AMGX_SAFE_CALL(AMGX_solver_destroy(m_solver)); AMGX_SAFE_CALL(AMGX_resources_destroy(m_resources)); AMGX_SAFE_CALL(AMGX_config_destroy(m_config));

That is correct sequence of calls.

Jaberh commented 4 years ago

Yes, m_ prefix denotes the members for every instance of the class. I arrange the global initialize and finalize globally like a singleton design so they are called only once, the handles only cause the issue, the config file for solvers are different as one solves for flow and the other solves for structure so need to have a separate file for each and hence m_config, m_resources, etc are object specific . Is that too much work to make them independent? I cant get the amgx integration pass the QA without solving multi instance cases. I could only make this work as is if config handle is not actually used in the resources_create function

piyueh commented 4 years ago

Disclaimer: I'm not an AmgX developer.

From my experience and based on the manual, as long as the hardware configurations for different solvers are the same (e.g., using the same GPUs, same MPI communicators, etc.), multiple solver instances can share the same resource object. The configuration file passed to AMGX_resources_create is to obtain settings of resources only.

Page 56 in the manual

The Resources object also stores settings that control usage of resources and communication patterns. These settings are passed in via the config input. ... Note that this Config may be the same one passed to AMGX_solver_create or it may be separately created - any parameters which are irrelevant to AMGX_resources_create will simply be ignored.

The resource-related configurations in a config file are described in page 129 in the manual. So I guess unless you use different settings in any of them, it's safe to share one resource among different solvers.

Jaberh commented 4 years ago

Thanks for the refs! yes, from the first email to use " one resource handle should be created " .... plus it is not a matter of being safe and it will through an exception for multiple instances., so it is the only way for now I think, that did not solve the error though! as I need to have a saparate comm for each resource handle as per page 129s first item

Jaberh commented 4 years ago

Hi Marat, Unfortunately, the communicator can change between the runs, as there might be some of the ranks that do not participate in the communication due to having zero elements, if this was also not unique, it would give much more flexibility for multi-physics simulations without the need for putting logic for the work around. In a nutshell my communicator are different for different we do ICE simulations with multi physics and everything is dynamic so this a bit pushed the limits of the solution

marsaev commented 4 years ago

Hi Jaberth, getting back to the issues.

Just to clarify, you still have an issue with multiple solvers - you have error when you try to create and use them?

Unfortunately, the communicator can change between the runs, as there might be some of the ranks that do not participate in the communication due to having zero elements, if this was also not unique, it would give much more flexibility for multi-physics simulations without the need for putting logic for the work around. In a nutshell my communicator are different for different we do ICE simulations with multi physics and everything is dynamic so this a bit pushed the limits of the solution

If i understand everything correctly i think we have briefly discussed it in another issue. Let me try to summarize: 1) You need to solve arbitrary number of linear systems 2) Set of ranks that has at least one row of the matrix may change from system to system

Few follow-up questions: 1) How do you handle this in your current solver? I.e. if you need collective MPI call for a particular linear system solve - do you create a separate comm for that? 2) What does the rank that has zero elements in the matrix? Blocked until participating ranks finished the solve? 3) Do every global rank (from comm world) has unique GPU assigned to him?

Jaberh commented 4 years ago

No Sir, the previous issue is resolved, I modified the title as well to reflect the need. All those questions that you have asked are relelated to previous issues, basically my last issue is that I need to be able to create multiple resources handles. and free them without getting weird errors.

lengfeih commented 3 years ago

Hi Jaberh,

First, your destroy sequence is not the recommended sequence in the manual. "AMGX solver destroy must be called prior to AMGX matrix destroy." on page 65.

Hi Marat,

I also have the same issue. Here is my working flow, I have to delete the current solver each time changing to another solver due to the memory leak issue.

Ideally, I would like to create the two solvers only once in the first timestep and only update the linear system matrix inside the loop. After the loop, I can delete the two solvers.

  AMGX_initialize();
  for( timestep increasing) {
      generate 1st linear system;
      create AMGX GMRES solver (including config/resource/matrix/vector/solver);
      solve the 1st linear system;
      delete the AMGX GMRES solver (including config/resource/matrix/vector/solver);

      generate 2nd linear system (with different nonzero pattern with 1st system) using the solution of 1st system;
      create AMGX AMG solver (including config/resource/matrix/vector/solver);
      solve the 2nd linear system;
      delete the AMGX AMG solver (including config/resource/matrix/vector/solver);
  }
 AMGX_Finalize();