ValeevGroup / mpqc

The Massively Parallel Quantum Chemistry program, MPQC, computes properties of atoms and molecules from first principles using the time independent Schrödinger equation.
66 stars 24 forks source link

MPI Comm collision #1

Closed asadchev closed 11 years ago

asadchev commented 11 years ago

The MPI memory group thread operates on MPI_COMM_WORLD. The ARMCI also operates on MPI_COMM_WORLD with no apparent method to create a new communicator (in the default PNNL version). Can we duplicate MPI memory group comm?

asadchev commented 11 years ago

Apparently, MTMPIMemoryGrp::init_mtmpimg DOES duplicate comm but for whatever reason under MPICH2 there is a collision.

jeffhammond commented 11 years ago

I fixed this at some point. Which armci you using? It's trivial to swap mpi_comm_world for armci_comm_world. I know it's there for tcgmsg. I'll add to armci SVN today if I can get wireless at ACS.

Sent from my iPhone

On Sep 10, 2013, at 1:52 PM, Andrey Asadchev notifications@github.com wrote:

The MPI memory group thread operates on MPI_COMM_WORLD. The ARMCI also operates on MPI_COMM_WORLD with no apparent method to create a new communicator (in the default PNNL version). Can we duplicate MPI memory group comm?

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/1 .

jeffhammond commented 11 years ago

Details? Stack trace?

Sent from my iPhone

On Sep 10, 2013, at 1:57 PM, Andrey Asadchev notifications@github.com wrote:

Apparently, MTMPIMemoryGrp::init_mtmpimg DOES duplicate comm but for whatever reason under MPICH2 there is a collision.

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/1#issuecomment-24181073 .

asadchev commented 11 years ago

At times I get this (impi built-in debugger crashes):

Assertion failed in file ch3u_request.c at line 149: MPIU_Object_get_ref(((req->dev.datatype_ptr))) >= 0

and

Assertion failed in file segment.c at line 495: 0

and finally when I can get stack trace:

0: [New Thread 0x7fffc7859700 (LWP 12746)] 0: [Thread 0x7fffc7859700 (LWP 12746) exited] 0: [New Thread 0x7fffc7859700 (LWP 12750)] 0: 0: Program received signal SIGSEGV, Segmentation fault. 0: [Switching to Thread 0x7fffc7859700 (LWP 12750)] 0: 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: (gdb) 0: (gdb) bt 0: #0 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: #1 0x00007ffff649614a in MPIDI_CH3_ReqHandler_PutRespDerivedDTComplete () 0: from /usr/lib/libmpich.so.3 0: #2 0x00007ffff64a20ff in MPIDI_CH3_PktHandler_Put () 0: from /usr/lib/libmpich.so.3 0: #3 0x00007ffff648eef0 in MPIDI_CH3I_Progress () from /usr/lib/libmpich.so.3 0: #4 0x00007ffff650d6b7 in PMPI_Recv () from /usr/lib/libmpich.so.3 0: #5 0x00000000010a6cfb in sc::MTMPIThread::run_one (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:106 0: #6 0x00000000010a6c45 in sc::MTMPIThread::run (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:86 0: #7 0x000000000109ef34 in sc::Thread::run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:77 0: #8 0x000000000109ee10 in Thread__run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:49 0: #9 0x00007ffff67eee9a in start_thread (arg=0x7fffc7859700) 0: at pthread_create.c:308 0: #10 0x00007ffff520ccbd in clone () 0: at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 0: #11 0x0000000000000000 in ?? ()

It kinda looks like error caused by threads BUT: MPI is initialized with MPI_THREAD_MULTIPLE and I further verify that: MPI_Query_thread(&thread); if (thread < MPI_THREAD_MULTIPLE) throw std::runtime_error("thread < MPI_THREAD_MULTIPLE");

jeffhammond commented 11 years ago

This it thread multiple? Armci doesn't mutex MPI...

Sent from my iPhone

On Sep 10, 2013, at 2:19 PM, Andrey Asadchev notifications@github.com wrote:

At times I get this (impi built-in debugger crashes):

Assertion failed in file ch3u_request.c at line 149: MPIU_Object_get_ref(((req->dev.datatype_ptr))) >= 0

and

Assertion failed in file segment.c at line 495: 0

and finally when I can get stack trace:

0: [New Thread 0x7fffc7859700 (LWP 12746)] 0: [Thread 0x7fffc7859700 (LWP 12746) exited] 0: [New Thread 0x7fffc7859700 (LWP 12750)] 0: 0: Program received signal SIGSEGV, Segmentation fault. 0: [Switching to Thread 0x7fffc7859700 (LWP 12750)] 0: 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: (gdb) 0: (gdb) bt 0: #0 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: #1 https://github.com/ValeevGroup/mpqc/issues/1 0x00007ffff649614a in MPIDI_CH3_ReqHandler_PutRespDerivedDTComplete () 0: from /usr/lib/libmpich.so.3 0: #2 0x00007ffff64a20ff in MPIDI_CH3_PktHandler_Put () 0: from /usr/lib/libmpich.so.3 0: #3 0x00007ffff648eef0 in MPIDI_CH3I_Progress () from /usr/lib/libmpich.so.3 0: #4 0x00007ffff650d6b7 in PMPI_Recv () from /usr/lib/libmpich.so.3 0: #5 0x00000000010a6cfb in sc::MTMPIThread::run_one (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:106 0: #6 0x00000000010a6c45 in sc::MTMPIThread::run (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:86 0: #7 0x000000000109ef34 in sc::Thread::run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:77 0: #8 0x000000000109ee10 in Thread__run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:49 0: #9 0x00007ffff67eee9a in start_thread (arg=0x7fffc7859700) 0: at pthread_create.c:308 0: #10 0x00007ffff520ccbd in clone () 0: at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 0: #11 0x0000000000000000 in ?? ()

It kinda looks like error caused by threads BUT: MPI is initialized with MPI_THREAD_MULTIPLE and I further verify that: MPI_Query_thread(&thread); if (thread < MPI_THREAD_MULTIPLE) throw std::runtime_error("thread < MPI_THREAD_MULTIPLE");

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/1#issuecomment-24182625 .

jeffhammond commented 11 years ago

I checked. Armci trunk uses its own world. Was fixed in 5.1 but 5.0 is going to break.

Armci uses send recv only for colls though. Does MPQC use armci colls?

Using armci-MPI is one way to debug this. It's definitely proper wrt MPI.

Sent from my iPhone

On Sep 10, 2013, at 2:19 PM, Andrey Asadchev notifications@github.com wrote:

At times I get this (impi built-in debugger crashes):

Assertion failed in file ch3u_request.c at line 149: MPIU_Object_get_ref(((req->dev.datatype_ptr))) >= 0

and

Assertion failed in file segment.c at line 495: 0

and finally when I can get stack trace:

0: [New Thread 0x7fffc7859700 (LWP 12746)] 0: [Thread 0x7fffc7859700 (LWP 12746) exited] 0: [New Thread 0x7fffc7859700 (LWP 12750)] 0: 0: Program received signal SIGSEGV, Segmentation fault. 0: [Switching to Thread 0x7fffc7859700 (LWP 12750)] 0: 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: (gdb) 0: (gdb) bt 0: #0 0x00007ffff6522774 in MPID_Segment_init () from /usr/lib/libmpich.so.3 0: #1 https://github.com/ValeevGroup/mpqc/issues/1 0x00007ffff649614a in MPIDI_CH3_ReqHandler_PutRespDerivedDTComplete () 0: from /usr/lib/libmpich.so.3 0: #2 0x00007ffff64a20ff in MPIDI_CH3_PktHandler_Put () 0: from /usr/lib/libmpich.so.3 0: #3 0x00007ffff648eef0 in MPIDI_CH3I_Progress () from /usr/lib/libmpich.so.3 0: #4 0x00007ffff650d6b7 in PMPI_Recv () from /usr/lib/libmpich.so.3 0: #5 0x00000000010a6cfb in sc::MTMPIThread::run_one (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:106 0: #6 0x00000000010a6c45 in sc::MTMPIThread::run (this=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/memmtmpi.cc:86 0: #7 0x000000000109ef34 in sc::Thread::run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:77 0: #8 0x000000000109ee10 in Thread__run_Thread_run (vth=0x7ffff7f90010) 0: at /home/andrey/mpqc/src/lib/util/group/thread.cc:49 0: #9 0x00007ffff67eee9a in start_thread (arg=0x7fffc7859700) 0: at pthread_create.c:308 0: #10 0x00007ffff520ccbd in clone () 0: at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 0: #11 0x0000000000000000 in ?? ()

It kinda looks like error caused by threads BUT: MPI is initialized with MPI_THREAD_MULTIPLE and I further verify that: MPI_Query_thread(&thread); if (thread < MPI_THREAD_MULTIPLE) throw std::runtime_error("thread < MPI_THREAD_MULTIPLE");

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/1#issuecomment-24182625 .

asadchev commented 11 years ago

No, no collectives. I am using armci from recent mpich distribution, but the system mpich2 is 1.4 Going to try with MPICH3

jeffhammond commented 11 years ago

Use armci-MPI from git repo. What's in MPICH repo is old and now deleted.

Sent from my iPhone

On Sep 10, 2013, at 2:52 PM, Andrey Asadchev notifications@github.com wrote:

No, no collectives. I am using armci from recent mpich distribution, but the system mpich2 is 1.4 Going to try with MPICH3

— Reply to this email directly or view it on GitHubhttps://github.com/ValeevGroup/mpqc/issues/1#issuecomment-24184929 .

asadchev commented 11 years ago

A newer MPICH seems to have solved the problem. I'd venture a guess something had to do with thread safery being broken in presence of MPI_THREAD_MULTIPLE. Will replace mpich2 armci in mpqc with mpi-arcmi from github.