charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
202 stars 49 forks source link

NAMD crash on BG/Q xlc #481

Closed nikhil-jain closed 10 years ago

nikhil-jain commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/481


NAMD crashes on BG/Q during start up.

Further runs are not reported in enough detail for me to encode as above

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-23 21:26:32


I locate that it is the commit that broke NAMD.

433874a75b6360d0e2b79c824bc9edfa752913d2 BGQ xlC #104: replace -qsmp options to handle missing __tls_get_addr with -qtls=local-exec, allowing static linking

This commit is not new. Even last week, it worked. This monday, ALCF updated software. After this, it never worked.

Since Phil commited this change, I will forward this to him

PhilMiller commented 5 years ago

Original date: 2014-04-23 21:38:36


This is a driver or compiler problem, as documented by Nikhil and Ronak. The same code ran fine on Vulcan when it was integrated, and on Mira or Vesta last week. Recent system-side upgrades include V1R2M1 efix 29 and February 2014 XL compilers (XL C/C++ v12.1.0.7).

PhilMiller commented 5 years ago

Original date: 2014-04-23 21:38:58


We need an answer on this prior to release.

rbuch commented 5 years ago

Original date: 2014-04-25 22:52:18


I just built a version on Vesta, and it runs fine. This is using the latest Charm (ea1ff85093610ff60b034168cf791e0af1a7cdc4) and a fresh checkout of NAMD. I tested SMP and non-SMP.

My config line was: ./config BlueGeneQ-xlC --charm-base ~/ronak/charm/ --charm-arch pamilrts-bluegeneq --without-tcl --fftw-prefix /home/phil/ronak/namd2/src/fftw/

My charm build line was: ./build charm++ pamilrts-bluegeneq -j16 --with-production xlC reports version 12.01.0000.0006. (November 2013)

I retested using the same method described above, except with the February 2014 (12.01.0000.0007) compiler. Everything still works.

PhilMiller commented 5 years ago

Original date: 2014-04-25 22:53:33


Please include the details of the compiler and driver version used, and the charm++ build command. Optimization flags, in particular, may be relevant to this result.

jcphill commented 5 years ago

Original date: 2014-04-28 15:39:51


Please test pami-bluegeneq-async-smp-xlc and pamilrts-bluegeneq-async-smp-xlc.

Jim

rbuch commented 5 years ago

Original date: 2014-04-28 21:16:16


Both pami-bluegeneq-async-smp-xlc and pamilrts-bluegeneq-async-smp-xlc worked fine for me on Vesta, using the same configuration as my earlier post.

jcphill commented 5 years ago

Original date: 2014-04-29 20:15:26


I can't reproduce the crash either, using git HEAD v6.6.0-rc3-33-g2904aa0 from yesterday (April 28).

PhilMiller commented 5 years ago

Original date: 2014-04-30 03:03:00


This looks like it was some sort of transient version skew issue, so no one can reproduce it now.

nikhil-jain commented 5 years ago

Original date: 2014-04-30 14:49:13


Fails for me on vesta with latest charm and NAMD from last week on vesta using pamilrts-bluegeneq:

and

/bgsys/drivers/V1R2M1/ppc64

PhilMiller commented 5 years ago

Original date: 2014-04-30 14:51:28


Ugh. Can we get Nikhil, Yanhua, Ronak, and Jim in a room looking at this all together, and see what's being done differently?

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-30 15:03:33


I tried yesterday. It still failed to me. Everything is using the default configuration. Only one change I have is in .bashrc

export PATH=/soft/perftools/bin:$(PATH)

Can not think about others.

Env result

LD_LIBRARY_PATH=/bgsys/drivers/ppcfloor/comm/lib:/soft/compilers/ibmcmp-aug2013/vac/bg/12.1/bglib64:/soft/compilers/ibmcmp-aug2013/vacpp/bg/12.1/bglib64:/soft/compilers/ibmcmp-aug2013/xlf/bg/14.1/bglib64:/soft/compilers/ibmcmp-aug2013/xlmass/bg/7.3/bglib64:/soft/compilers/ibmcmp-aug2013/xlsmp/bg/3.1/bglib64:/dbhome/db2cat/sqllib/lib64:/dbhome/db2cat/sqllib/lib32 XLF_USR_CONFIG=/soft/compilers/ibmcmp-aug2013/xlf/bg/14.1/etc/V1R2M1.xlf.cfg.rhel6.5.gcc447 COMM_SELECT=xl NLSPATH=/soft/compilers/ibmcmp-aug2013/msg/bg/%L/%N MAIL=/var/spool/mail/jessie PATH=/usr/lib64/qt-3.3/bin:/soft/perftools/darshan/darshan/bin:/soft/compilers/wrappers/xl:/home/jessie/bin:/soft/environment/softenv-1.6.2/bin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/bgsys/drivers/V1R2M1/ppc64/gnu-linux/bin:/bgsys/drivers/V1R2M1/ppc64/hlcs/bin:/soft/debuggers/scripts/bin:/soft/accttools/bin:/soft/compilers/ibmcmp-aug2013/vac/bg/12.1/bin:/soft/compilers/ibmcmp-aug2013/vacpp/bg/12.1/bin:/soft/compilers/ibmcmp-aug2013/xlf/bg/14.1/bin:/bgsys/drivers/ppcfloor/bin:/bgsys/drivers/ppcfloor/sbin:/dbhome/db2cat/sqllib/bin:/dbhome/db2cat/sqllib/adm:/dbhome/db2cat/sqllib/misc:/usr/lpp/mmfs/bin:/home/jessie/bin

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-30 15:05:03


For me pamilrts-bluegeneq hangs during start up. This is even worse than crash. Smp crashed.

jcphill commented 5 years ago

Original date: 2014-04-30 15:30:54


Maybe it's the aug2013 compilers in your path.

I'm using:

./build charm++ pami-bluegeneq async smp xlc --no-build-shared --with-production

Linux miralac1 2.6.32-431.el6.ppc64 #1 SMP Sun Nov 10 22:17:43 EST 2013 ppc64 ppc64 ppc64 GNU/Linux

/usr/lib64/qt-3.3/bin:/home/jphillip/bin:/soft/environment/softenv-1.6.2/bin:/bin:/usr/bin:/usr/local/bin:/usr/X11R6/bin:/sbin:/usr/sbin:/bgsys/drivers/V1R2M1/ppc64/gnu-linux/bin:/bgsys/drivers/V1R2M1/ppc64/hlcs/bin:/soft/debuggers/scripts/bin:/soft/cobalt/bin:/soft/accttools/bin:/soft/compilers/ibmcmp-feb2014/vac/bg/12.1/bin:/soft/compilers/ibmcmp-feb2014/vacpp/bg/12.1/bin:/soft/compilers/ibmcmp-feb2014/xlf/bg/14.1/bin:/dbhome/db2cat/sqllib/bin:/dbhome/db2cat/sqllib/adm:/dbhome/db2cat/sqllib/misc:/usr/lpp/mmfs/bin:/home/jphillip/bin

./config BlueGeneQ-xlC --charm-base $HOME/charm-6.6 --charm-arch pami-bluegeneq-async-smp-xlc

Selected arch file arch/BlueGeneQ-xlC.arch contains:

NAMD_ARCH = BlueGeneQ CHARMARCH = pamilrts-bluegeneq-async-smp-xlc

CXX = bgxlC_r -qstaticinline -qsuppress=1540-0448:1500-036 -DNO_SOCKET -DDUMMY_VMDSOCK -DNOHOSTNAME -DNO_CHDIR -DNO_STRSTREAM_H -DNO_GETPWUID -DARCH_POWERPC

CXXMEMUSAGE = $(CHARMC)

CXXOPTS = -O3 -Q -qhot CXXNOALIASOPTS = -O3 -Q -qalias=noallptrs:notypeptr -qdebug=plst3:cycles -qdebug=QPLAT:QPLAT27 -DA2_QPX

  1. NOTE: -qdebug=QPLAT:QPLAT27 is a NAMD-specific optimization for A2_QPX code

CXXTHREADOPTS = -O3 -Q

CC = bgxlc_r COPTS = -O3 -qhot

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-04-30 16:15:16


Yes. As Jim said, I am using Aug 2013 drivers, while he is using Feb 2014. Nikhil can you check yours?

Also where do we config what drivers to use

PhilMiller commented 5 years ago

Original date: 2014-05-05 18:45:17


Since it's not recorded here, and so I can't tell if Yanhua's questions got answered:

This is configured using the 'softenv' system, in the file ~/.soft

This file is read at login and when running the resoft command.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-05 22:33:44


Ronak

Can you check what drivers, compiler you have to run successfully?

This is pretty urgent.

rbuch commented 5 years ago

Original date: 2014-05-05 23:20:12


The runs on Mira and Vesta were done on Phil's account, so I don't have access to the machine right now. They were run with the default configuration, so if Phil could post the details, they would apply to my runs. Some details of my environment are in my earlier post.

PhilMiller commented 5 years ago

Original date: 2014-05-06 00:39:12


Ronak specified what compilers, build targets, and FFTW he used in comments #481-4 and #481-7. The only unspecified bit was that my environment references the current default V1R2M1 drivers.

PhilMiller commented 5 years ago

Original date: 2014-05-06 00:57:19


So, reviewing today's discussion systematically, here are the potential variables identified, and the values that I believe they could take on in these experiments

In their full combinatorial explosion, that's at least 432 possibilities. Please start carrying out controlled experiments, enumerating the value of each of these variables, to narrow those down.

I'm going to try to extract the specific results reported so far into the issue description. For missing details, please edit to fill in the runs you performed

PhilMiller commented 5 years ago

Original date: 2014-05-06 01:09:15


Edit reports of your runs, successful, crash, or hang, into the issue description, so that we can see everything concisely and concretely.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-06 03:49:31


I switched driver and compiler to Feb 2014. Still same problem

PhilMiller commented 5 years ago

Original date: 2014-05-07 20:37:55


Yanhua described over lunch how she got different results (I think) depending on whether she compiled Charm++ using --with-production or not. I believe it was the builds that had it that ran, and those without it that didn't. Here are the changes applied by --with-production:

OPTS="-optimize -production $OPTS"
CONFIG_OPTS="--disable-controlpoint --disable-tracing --disable-tracing-commthread --disable-charmdebug --disable-replay --disable-error-checking --disable-stats $CONFIG_OPTS"

The likely culprits out of that set would seem to be --disable-error-checking and -optimize.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-07 22:51:42


When I enable-error-checking but disabling everything else, it reproduces the problem. The problem is somewhere in the error checking code

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-08 02:44:26


The error for the crash in smp is

Reason: CqsEnqueueGeneral: invalid queueing strategy.

[4096] Assertion "startNode >=0 && startNode<_Cmi_numnodes" failed in file machine-broadcast.c line 104. [34] Assertion "startNode >=0 && startNode<_Cmi_numnodes" failed in file machine-broadcast.c line 104.

PhilMiller commented 5 years ago

Original date: 2014-05-08 02:48:00


That looks like at least two different assertion failures. Are there others as well? What's the smallest node/process count you can run it and observe a crash? 4k seems rather large.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-08 23:28:09


It crashes even on 1 core. It is double free corruption.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-08 23:30:47


Stack trace shows as the following /gpfs/vesta-home/jessie/NAMD/namd2/BlueGeneQ-xlC.nosmp/src/ObjectArena.h:21 /gpfs/vesta-home/jessie/NAMD/namd2/BlueGeneQ-xlC.nosmp/src/Molecule.C:3243 delete tmpArena; tmpArena = 0

Commenting it out leads to seg fault later. However, no core dump is generated

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-09 02:40:45


I tried to use memory paranoid. However, it complains about multiple definition of free, malloc.

rbuch commented 5 years ago

Original date: 2014-05-09 03:25:15


I retested on Vulcan without --with-production, and it failed immediately with glibc detected ./namd2: double free or corruption (!prev): 0x00000019c9409ae0 ***.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-09 05:07:21


My recent experiments show that it runs fine with -optimize (-O3). If without this, it crashes.

jcphill commented 5 years ago

Original date: 2014-05-09 15:27:24


Yanhua Sun wrote:

Stack trace shows as the following /gpfs/vesta-home/jessie/NAMD/namd2/BlueGeneQ-xlC.nosmp/src/ObjectArena.h:21 /gpfs/vesta-home/jessie/NAMD/namd2/BlueGeneQ-xlC.nosmp/src/Molecule.C:3243 delete tmpArena; tmpArena = 0

Commenting it out leads to seg fault later. However, no core dump is generated

This code looks clean, and removing it shouldn't cause a segfault.

I would guess it's a race condition in the malloc/free implementation except you say it happens even on one core. Is that an smp or non-smp build? Can you switch to a different malloc?

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-09 16:40:12


It is nonsmp version running on 1 core

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-13 00:28:01


I tried DDT. I was able to submit jobs but not see any output. If anyone has experience about how to use it, let me know.

PhilMiller commented 5 years ago

Original date: 2014-05-13 15:04:37


I think I've reproduced this on 2 PEs under Address Sanitizer on net-linux-x86_64-clang. Unfortunately, I can't seem to actually capture the output it's supposed to produce, because it dies before anything gets flushed. Here's the tail end of my output:

Info: 177 BONDS
Info: 435 ANGLES
Info: 446 DIHEDRAL
Info: 45 IMPROPER
Info: 0 CROSSTERM
Info: 83 VDW
Info: 6 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 1.5974e-05
Info: 
Info: Entering startup at 0.027694 s, 2.09716e+07 MB of memory in use
Info: Startup phase 0 took 0.0313032 s, 2.09716e+07 MB of memory in use
Charmrun: error on request socket--
Socket closed before recv.

Similarly, the PE 0 output ends with the "Startup phase 0 took . . ." line.

PhilMiller commented 5 years ago

Original date: 2014-05-13 16:12:15


I think I spoke too soon. It looks like I was actually just not using exactly the right NAMD input configuration, and the error was just sometimes getting swallowed. I'll need to sit with someone who knows NAMD better, and maybe we can make this work.

nikhil-jain commented 5 years ago

Original date: 2014-05-14 16:31:47


Here is what I found:

ResizeArray's behavior may be the culprit for the crashes. Malloc is doing its job right; the double free reported is actually a double free attempted. However, it is caused by erroneous behavior of ResizeArray. After a certain number of entries are pushed to it (82 to be precise), all entries are set to the same value as the first entry. I tried replacing it with a vector in ObjectArena (where the crash was happening), and things went smooth until they crashed later (which may be because of similar use of ResizeArray later, just a guess).

Why ResizeArray size is behaving like this is not known to me (I did not look into it and may not get time to do so).

PhilMiller commented 5 years ago

Original date: 2014-05-14 18:34:17


I've gotten NAMD to build on Charm++ using bgclang with Address Sanitizer enabled for the whole assemblage. The resulting binary can be found at vesta:~phil/namd2.asan. If someone could run a test using that binary, we may be able to get more detailed information.

PhilMiller commented 5 years ago

Original date: 2014-05-14 18:40:01


Oh, yeah, an important note: This binary is dynamically linked, to support Address Sanitizer's restrictions. When running such binaries, I've had to set the environment variable flag

--env LD_LIBRARY_PATH=/bgsys/drivers/ppcfloor/comm/lib/:/bgsys/drivers/toolchain/V1R2M1_base_4.7.2/gnu-linux-4.7.2/powerpc64-bgq-linux/lib/

This requirement might be a bug on the systems and their toolchains, or it may just be a limitation we have to deal with

nikhil-jain commented 5 years ago

Original date: 2014-05-14 20:41:51


We have narrowed it down to CmiMemcpy which is inducing spurious results. May be someone should look at CmiMemcpy on BG/Q to find why is it so.

nikhil-jain commented 5 years ago

Original date: 2014-05-14 21:12:49


Confirming the cause of the bug as CmiMemcpy - replacing it with a simple memcpy in Charm++ makes NAMD run just fine. For those who are debugging, the CmiMemcpy inside the ResizeArrayRaw.h's CmiMemcpy is the one I have been toying with. It results in all entries in the array being assigned the same value (value of zeroeth element) for some invocations. Sending a mail to Sameer to confirm utility of special mem copy.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 21:16:58


I am currently debugging CmiMemCpy now in case anyone else is also working on it

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 22:01:57


I just wrote a simple program to test CmiMemcpy on BGQ. Even the simple test fails due to Cmimemcpy_qpx. Also it is also true that it works with -O3 but fails without using it

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 22:09:35


It seems for every 32 bytes, the first 8 bytes are right and then the other 24 bytes are wrong

PhilMiller commented 5 years ago

Original date: 2014-05-14 22:14:29


Is the test code feeding it 32-byte aligned arguments?

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 22:36:36


Yes. It is 32 bytes aligned

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 23:17:16


The problem is in the assembly load, store. In the -O3 case, it loads 32 bytes and stores 32 bytes. However, without it, it seems load only 8 bytes and fill 32 bytes using the 8 byte content repeatly.

asm volatile("qvlfdx %0,%1,%2": "=f"(fp) : "b" (si), "r" (sb)); asm volatile("qvstfdx %2,%0,%1": : "b" (si), "r" (sb), "f"(fp) :"memory");

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 23:40:52


I have written a minimal test to reproduce this bug. It is a standard c program (not charm). I guess we can give this test to ANL or IBM to figure out.

bgxlc -O3 loadstore.c compiles it, It works. bgxlc loadstore.c compiles it, It fails by output data error

include

define QPX_LOAD(si,sb,fp) \

do { \ asm volatile("qvlfdx %0,%1,%2": "=f"(fp) : "b" (si), "r" (sb)); \ } while(0);

define QPX_STORE(si,sb,fp) \

do { \ asm volatile("qvstfdx %2,%0,%1": : "b" (si), "r" (sb), "f"(fp) :"memory"); \ } while(0);

ifndef GNUC

define FP_REG(i) asm("f"#i)

define FP_REG1(i) "fr"#i

else

define FP_REG(i) asm("fr"#i)

define FP_REG1(i) "fr"#i

endif

//Copy 512 bytes from a 32b aligned pointers static inline size_t quad_copy_512( char dest, const char src ) { register double fpp1_1, fpp1_2; register double fpp2_1, fpp2_2;

register double f0 FP_REG(0);
register double f1 FP_REG(1);
register double f2 FP_REG(2);
register double f3 FP_REG(3);
register double f4 FP_REG(4);
register double f5 FP_REG(5);
register double f6 FP_REG(6);
register double f7 FP_REG(7);

int r0;
int r1;
int r2;
int r3;
int r4;
int r5;
int r6;
int r7;
r0 = 0;
r1 = 64;
r2 = 128;
r3 = 192;
r4 = 256;
r5 = 320;
r6 = 384;
r7 = 448;

fpp1_1 = (double *)src;
fpp1_2 = (double *)src +4;

fpp2_1 = (double *)dest;
fpp2_2 = (double *)dest +4;

QPX_LOAD(fpp1_1,r0,f0);
//asm volatile("qvlfdx 0,%0,%1": : "Ob" (fpp1_1), "r"(r0) :"memory");
QPX_LOAD(fpp1_1,r1,f1);
QPX_LOAD(fpp1_1,r2,f2);
QPX_LOAD(fpp1_1,r3,f3);
QPX_LOAD(fpp1_1,r4,f4);
QPX_LOAD(fpp1_1,r5,f5);
QPX_LOAD(fpp1_1,r6,f6);QPX_LOAD(fpp1_1,r7,f7);

QPX_STORE(fpp2_1,r0,f0);
QPX_STORE(fpp2_1,r1,f1);
QPX_STORE(fpp2_1,r2,f2);
QPX_STORE(fpp2_1,r3,f3);
QPX_STORE(fpp2_1,r4,f4);
QPX_STORE(fpp2_1,r5,f5);
QPX_STORE(fpp2_1,r6,f6);
QPX_STORE(fpp2_1,r7,f7);

QPX_LOAD(fpp1_2,r0,f0);
QPX_LOAD(fpp1_2,r1,f1);
QPX_LOAD(fpp1_2,r2,f2);
QPX_LOAD(fpp1_2,r3,f3);
QPX_LOAD(fpp1_2,r4,f4);
QPX_LOAD(fpp1_2,r5,f5);
QPX_LOAD(fpp1_2,r6,f6);
QPX_LOAD(fpp1_2,r7,f7);

QPX_STORE(fpp2_2,r0,f0);
QPX_STORE(fpp2_2,r1,f1);
QPX_STORE(fpp2_2,r2,f2);
QPX_STORE(fpp2_2,r3,f3);
QPX_STORE(fpp2_2,r4,f4);
QPX_STORE(fpp2_2,r5,f5);
QPX_STORE(fpp2_2,r6,f6);
QPX_STORE(fpp2_2,r7,f7);
return 0;

}

void CmiMemcpy_qpx (void dst, const void src, size_t n) { const char s = src; char d = dst; int n512 = n >>9; if( (long)d &31 != 0 || (long)s &31 ) { printf("need to be aligned \n"); exit(0); }

while (n512>0) { printf("-------%p %p in cmimemcpy %d out of %d \n", d, s, n512, n); quad_copy_512(d, s); d += 512; s += 512; n512--; }

if ( (n & 511UL) != 0 ) memcpy (d, s, n & 511UL); }

int main() {

int i;
double *d1 =(double*) memalign(32, 80*sizeof(double));
double *d2 = (double*)memalign(32, 80*sizeof(double));
for(i=0; i<80; i++)
    d1[i] = i;    CmiMemcpy_qpx(d2,d1,80*sizeof(double));
for(i=0; i<80; i++)
{
    if(d2[i] != i)
        printf("------error-%d  %d - %f \n", i, i*sizeof(double),d2[i]);
    else
        printf("------right-%d  %d - %f \n", i, i*sizeof(double),d2[i]);
}

return 0;

}

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-15 14:49:14


I have sent the test code and object files to Sameer

PhilMiller commented 5 years ago

Original date: 2014-05-15 18:28:31


Could we get a performance comparison of NAMD compiled against Charm++ with -O3 --with-production using the current CmiMemcpy and one that is just the platform native memcpy. If there's no longer a difference, then we should just switch over to the native one. If there is a difference, maybe we could stick the QPX version under a flag controlled by charmc -production (as we have on bluegenep for some network-layer optimizations)