charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

XC30 ChaNGa crashes at startup #486

Closed PhilMiller closed 10 years ago

PhilMiller commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/486


Per Tom, running on the Swiss machine PizDaint:

Charm++> Running on Gemini (GNI) with 128 processes
Charm++> static SMSG
Charm++> SMSG memory: 632.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means
no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 128,  7 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.6.0-rc3-34-gdaac868
CharmLB> Load balancer assumes all CPUs are same.
CharmLB> Load balancing instrumentation for communication is off.
Charm++> cpu affinity enabled.
Charm++> cpuaffinity PE-core map : 1-7
Charm++> set comm 0 on node 0 to core #0
Charm++> Running on 128 unique compute nodes (16-way SMP).
[0] MultistepLB_notopo created
WARNING: bStandard parameter ignored; Output is always standard.
WARNING: bOverwrite parameter ignored.
ChaNGa version 3.0, commit v3.0-64-g2698882
Running on 896 processors/ 128 nodes with 65536 TreePieces
.
. (possible missing text not provided)
.
Domain decomposition...SFC Peano-Hilbert
Created 65536 pieces of tree
Loading particles ... trying Tipsy ... took 2.413703 seconds.
N: 75870839
Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657
Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000
WARNING: Could not open redshift input file: Eris_Core.red
Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations
.
[1022] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-br
oadcast.c line 54.
------------- Processor 1022 Exiting: Called CmiAbort ------------
Reason:
[953] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-bro
adcast.c line 54.
------------- Processor 953 Exiting: Called CmiAbort ------------
PhilMiller commented 5 years ago

Original date: 2014-05-05 17:05:08


Someone who has the necessary knowledge and access to an XC30 on an appropriate allocation to address this issue needs to claim it. Possible machines include Eos at ORNL, Edison at NERSC, and PizDaint itself.

When building, running, and reporting results please document the compiler and build command used. Discussion, confirmations, and conflicting results are utterly meaningless without that.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-07 16:25:06


Can Harshita or Lukasz (from the ChaNGa group) who has the Edison account reproduce this bug on Edison?

trquinn commented 5 years ago

Original date: 2014-05-07 22:55:20


On Wed, 7 May 2014, Yanhua Sun wrote:

Hi Tom Phil said you reported a crash for XC30. Can I know how you build Charm++ to reproduce the bug?

First the modules I have loaded (note in particular I'm using GCC): 1) modules/3.2.6.7 2) eswrap/1.1.0-1.010400.915.0 3) switch/1.0-1.0501.47124.1.93.ari 4) craype-network-aries 5) craype/2.05 6) craype-sandybridge 7) slurm 8) cray-mpich/6.2.2 9) gcc/4.8.2 10) totalview-support/1.1.4 11) totalview/8.11.0 12) cray-libsci/12.1.3 13) udreg/2.3.2-1.0501.7914.1.13.ari 14) ugni/5.0-1.0501.8253.10.22.ari 15) pmi/5.0.2-1.0000.9906.117.2.ari 16) dmapp/7.0.1-1.0501.8315.8.4.ari 17) gni-headers/3.0-1.0501.8317.12.1.ari 18) xpmem/0.1-2.0501.48424.3.3.ari 19) job/1.5.5-0.1_2.0501.48066.2.43.ari 20) csa/3.0.0-1_2.0501.47112.1.91.ari 21) dvs/2.4_0.9.0-1.0501.1672.2.122.ari 22) alps/5.1.1-2.0501.8507.1.1.ari 23) rca/1.0.0-2.0501.48090.7.46.ari 24) atp/1.7.1 25) PrgEnv-gnu/5.1.29 26) craype-hugepages8M

And the build command itself:

./build ChaNGa gni-crayxc hugepages smp -j4 -O2

For ChaNGa it is: ./configure --enable-bigkeys; make

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-07 22:58:44


I do not know whether this is related to with production (without production). Since in NAMD it also has crashes in machine-broadcast when with-production is not used, I just suspect whether it is same in ChaNGa.

Tomm, Can I know how you build Charm++

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-07 23:03:42


Hi Tom

Can you try to build Charm with production?
./build ChaNGa gni-crayxc hugepages smp -j4 --with-production

trquinn commented 5 years ago

Original date: 2014-05-07 23:20:38


building with:

./build ChaNGa gni-crayxc hugepages smp -j4 --with-production then building ChaNGa, gives a different error:


Charm++> Running on Gemini (GNI) with 128 processes
Charm++> static SMSG
Charm++> SMSG memory: 632.0KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means 
no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 128,  7 worker threads per process
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.6.0-rc3-34-gdaac868
CharmLB> Load balancer assumes all CPUs are same.
CharmLB> Load balancing instrumentation for communication is off.
Charm++> cpu affinity enabled. 
Charm++> cpuaffinity PE-core map : 1-7
Charm++> set comm 0 on node 0 to core #0
Charm++> Running on 128 unique compute nodes (16-way SMP).
[0] MultistepLB_notopo created
WARNING: bStandard parameter ignored; Output is always standard.
WARNING: bOverwrite parameter ignored.
ChaNGa version 3.0, commit v3.0-64-g2698882
Running on 896 processors/ 128 nodes with 65536 TreePieces
yieldPeriod set to 5
Prefetching...ON
Number of chunks for remote tree walk set to 1
Chunk Randomization...ON
cache 1
cacheLineDepth 4
Verbosity level 1
Domain decomposition...SFC Peano-Hilbert
Created 65536 pieces of tree
Loading particles ... trying Tipsy ... took 2.452627 seconds.
N: 75870839
Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657
Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000
WARNING: Could not open redshift input file: Eris_Core.red
Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations
.
------------- Processor 901 Exiting: Called CmiAbort ------------
Reason: Could not malloc()--are we out of memory? (used: 2012.938MB)
------------- Processor 140 Exiting: Called CmiAbort ------------
Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 142 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 143 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 144 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 665 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 145 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 141 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 146 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 1022 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [1022] registerFromMempool; err=GNI_RC_INVALID_PARAM ------------- Processor 926 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2004.550MB) ------------- Processor 1012 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [1012] registerFromMempool; err=GNI_RC_INVALID_PARAM ------------- Processor 991 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 1996.161MB) ------------- Processor 1021 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2038.104MB) ------------- Processor 854 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

------------- Processor 965 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2021.327MB) ------------- Processor 983 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [983] registerFromMempool; err=GNI_RC_PERMISSION_ERROR [901] Stack Traceback: [901:0] [0x20246125] [901:1] [0x201aa611] [901:2] [0x2024ff3f] [901:3] [0x2024a470] [901:4] [0x2024b64b] [901:5] [0x2024d01e] [901:6] [0x2024d3bd] [901:7] [0x2024d445] [901:8] [0x202776f6] [901:9] [0x203ad2f9] aborting job: Could not malloc()--are we out of memory? (used: 2012.938MB)

Here are the symbols for that traceback:

trq`daint104:~/scratch_daint> gdb ./ChaNGa.gni GNU gdb (GDB) SUSE (7.5.1-1.0000.0.3.1) Copyright (C) 2012 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /scratch/daint/trq/ChaNGa.gni...done. (gdb) x/i 0x20246125 0x20246125 <CmiAbortHelper+101>: add $0x8,%rsp (gdb) x/i 0x201aa611 0x201aa611 <CmiOutOfMemory+97>: add $0xd8,%rsp (gdb) x/i 0x2024ff3f 0x2024ff3f <CmiAlloc+63>: mov 0x8(%rsp),%rax (gdb) x/i 0x2024a470 0x2024a470 <PumpNetworkSmsg+656>: mov 0x48(%rsp),%rsi (gdb) x/i 0x2024b64b 0x2024b64b <LrtsAdvanceCommunication+27>: callq 0x20248850 (gdb) x/i 0x2024d01e 0x2024d01e <CommunicationServer.isra.17+14>: mov 0x2032a9a4(%rip),%eax # 0x405779c8 (gdb) x/i 0x2024d3bd 0x2024d3bd <ConverseRunPE+685>: xor %eax,%eax

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-08 01:38:00


It seems it runs out of memory. Maybe you can try the same particle system using more nodes?

trquinn commented 5 years ago

Original date: 2014-05-08 03:02:12


I'll give it a go, but note that 2000 MB is less than 10% of the available memory/node on Piz Daint.

trquinn commented 5 years ago

Original date: 2014-05-08 15:40:50


I ran it on 256 nodes and a similar error message: . . . Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations . ------------- Processor 1879 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 1761.702MB)

trquinn commented 5 years ago

Original date: 2014-05-08 16:49:00


I decided to find out how much memory it was asking for when it failed: I stuck in a statement that looks like:

diff --git a/src/conv-core/convcore.c b/src/conv-core/convcore.c
index 59b7bf4..152c058 100644
--- a/src/conv-core/convcore.c
+++ b/src/conv-core/convcore.c
`` -2876,6 +2876,7 `` void *CmiAlloc(int size)
   res =(char *) malloc_nomigrate(size+sizeof(CmiChunkHeader));
 #endif

+  if(res == NULL) CmiError("Failed malloc of %d bytes\n", size);
   _MEMCHECK(res);

 #ifdef MEMMONITOR

And (running on 128 nodes again) I get:

.
.
.
Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations
.
Failed malloc of -1610612736 bytes
Failed malloc of -1073741824 bytes
Failed malloc of -1073741824 bytes
Failed malloc of -2147483648 bytes
Failed malloc of -313925248 bytes
Failed malloc of -1610612736 bytes
------------- Processor 927 Exiting: Called CmiAbort ------------
Reason: Could not malloc() -1 bytes--are we out of memory? (used :1870.332MB)
------------- Processor 970 Exiting: Called CmiAbort ------------
Reason: Could not malloc() -1 bytes--are we out of memory? (used :1778.057MB)
Failed malloc of -536870912 bytes
Failed malloc of -2147483648 bytes
Failed malloc of -165821712 bytes
Failed malloc of -247029697 bytes
------------- Processor 925 Exiting: Called CmiAbort ------------
.
.
.

I assume we don't expect malloc() to be happy with asking for negative amounts of memory.

nikhil-jain commented 5 years ago

Original date: 2014-05-12 16:20:49


It is far-fetched, but may be worth trying - yesterday I found a bug in typedef for INT8 on GNI-based machines. Phil implemented a generic fix which was just merged. Can Harshitha/Tom give a shot with current master?

trquinn commented 5 years ago

Original date: 2014-05-12 17:56:03


Not good: it doesn't compile:

../bin/charmc -O2 -I. -c -o RandCentLB.o RandCentLB.C In file included from lbdb.h:9:0, from LBDatabase.h:9, from BaseLB.h:9, from CentralLB.h:9, from DummyLB.h:9, from DummyLB.C:6: converse.h:617:22: error: conflicting declaration 'typedef CmiUInt8 CmiIntPtr' typedef CmiUInt8 CmiIntPtr; ^

git show: commit e860e5b24cf003c9703578b726e5ca9beec8b5d0 Author: Phil Miller <mille121`illinois.edu> Date: Sun May 11 22:53:19 2014 -0500

cc --version: gcc (GCC) 4.8.2 20131016 (Cray Inc.)

cat /etc/SUSE-release: SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 2

nikhil-jain commented 5 years ago

Original date: 2014-05-12 18:33:05


Sorry about that. SMP compilation was not tested - Phil checked in a fix. Please try again.

harshithamenon commented 5 years ago

Original date: 2014-05-14 16:35:13


I tested with dwf1.6144 data against the master branch of charm and it crashes.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-14 16:47:54


Currently, I have no account on Edison to debug this problem. The turn round time on EOS is pretty long.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-20 05:15:07


I have run O3 nonsmp ApoA1 NAMD with cmimemcpy_qpx and memcpy on 1 core , 16 cores. There is almost no difference for performance. We can probably just use memcpy

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-05-28 17:51:15


What is the progress of this problem? Currently this one seems to be only problem preventing from releasing.

trquinn commented 5 years ago

Original date: 2014-05-28 18:00:09


Here is a clip of the latest email exchange. Bottom line: it works on EOS, but still fails on Piz Daint:

looks like Piz Daint has newer software stack than eos. Here is what I have on eos.

1) modules/3.2.6.7 2) eswrap/1.0.20-1.010200.643.0 3) switch/1.0-1.0500.41328.1.120.ari 4) craype-network-aries 5) cray-mpich/6.0.2 6) atp/1.6.3 7) rca/1.0.0-2.0500.41336.1.120.ari 8) dvs/2.3_0.9.0-1.0500.1522.1.180 9) csa/3.0.0-1_2.0500.41366.1.129.ari 10) job/1.5.5-0.1_2.0500.41368.1.92.ari 11) xpmem/0.1-2.0500.41356.1.11.ari 12) gni-headers/3.0-1.0500.7161.11.4.ari 13) dmapp/6.0.1-1.0500.7263.9.31.ari 14) pmi/4.0.1-1.0000.9725.84.2.ari 15) ugni/5.0-1.0500.0.3.306.ari 16) udreg/2.3.2-1.0500.6756.2.10.ari 17) cray-libsci/12.1.01 18) gcc/4.8.1 19) craype/1.06 20) craype-sandybridge 21) altd/1.0 22) lustredu/1.3 23) DefApps 24) PrgEnv-gnu/5.0.41 25) git/1.8.3.4 26) cray-hdf5/1.8.11 27) netcdf/4.3.0 28) fftw/3.3.0.4 29) craype-hugepages8M 30) stat/2.0.0.1

Gengbin

On 5/27/2014 1:18 PM, Tom Quinn wrote:

Here is what I have on Piz Daint:

module list Currently Loaded Modulefiles: 1) modules/3.2.6.7 2) eswrap/1.1.0-1.010400.915.0 3) switch/1.0-1.0501.47124.1.93.ari 4) craype-network-aries 5) craype/2.05 6) craype-sandybridge 7) slurm 8) cray-mpich/6.2.2 9) gcc/4.8.2 10) totalview-support/1.1.4 11) totalview/8.11.0 12) cray-libsci/12.1.3 13) udreg/2.3.2-1.0501.7914.1.13.ari 14) ugni/5.0-1.0501.8253.10.22.ari 15) pmi/5.0.2-1.0000.9906.117.2.ari 16) dmapp/7.0.1-1.0501.8315.8.4.ari 17) gni-headers/3.0-1.0501.8317.12.1.ari 18) xpmem/0.1-2.0501.48424.3.3.ari 19) job/1.5.5-0.1_2.0501.48066.2.43.ari 20) csa/3.0.0-1_2.0501.47112.1.91.ari 21) dvs/2.4_0.9.0-1.0501.1672.2.122.ari 22) alps/5.1.1-2.0501.8507.1.1.ari 23) rca/1.0.0-2.0501.48090.7.46.ari 24) atp/1.7.1 25) PrgEnv-gnu/5.1.29 26) craype-hugepages8M

Tom Quinn Astronomy, University of Washington Internet: trq`astro.washington.edu Phone: 206-685-9009

On Tue, 27 May 2014, Gengbin Zheng wrote:

Maybe we should look and compare the module versions on both machines (Piz Paint and EOS). I am not sure how much time I have this week looking into this problem to be able to repeat the bug before I go to China next week (for a month). I have some SC paper to review too, due this Saturday.

Gengbin On Tue, May 27, 2014 at 10:33 AM, Tom Quinn <trq`astro.washington.edu> wrote: I've tried to use all the same command line arguments, and I still get: ... Loading particles ... trying Tipsy ... took 2.490938 seconds. N: 75870839 Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657 Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000 WARNING: Could not open redshift input file: Eris_Core.red Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations . [1023] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-br oadcast.c line 54. ------------- Processor 1023 Exiting: Called CmiAbort ------------


One thing that is different: Piz Daint only has 8 physical cores/node, although hyperthread is enabled. Does EOS have 16 physical cores, or are you using hyperthreading?

Tom Quinn Astronomy, University of Washington Internet: trq`astro.washington.edu Phone: 206-685-9009

On Fri, 23 May 2014, Gengbin Zheng wrote:

I just ran this benchmark on 128 nodes of eos, it works for me. My command line:

aprun -n 256 -N 2 ./ChaNGa +stacksize 2000000 -D 3 -p 65536 -wall 30 -killat 2000 T_S.param ++ppn 15 +commap 0,16 +pemap 1-15,17-31 +setcpuaffinity +balancer MultistepLB_notopo +LBPeriod 0.0 +LBDebug 1 +noAnytimeMigration +LBCommOff

Gengbin

harshithamenon commented 5 years ago

Original date: 2014-06-09 02:55:33


Turning on the checksum_flag in machine.c throws 'check sum doesn't agree' error on eos and Piz Daint for smaller number of cores as well and for smaller datasets.

On Piz Daint

Charm was built the following way with intel compiler and checksum_flag set. ./build ChaNGa gni-crayxc hugepages smp -j8

ChaNGa was run on cube300 dataset on 8 nodes in smp mode ./ChaNGa +stacksize 2000000 -D 1 -wall 5 cube300.param ++ppn 7 +commap 0 +pemap 1-7 +setcpuaffinity

Error thrown Checksum doesn't agree!

The checksum error is happening in machine.c in the gni-crayxc layer in the PumpLocalTransactions function in line 2932 where the checksum is checked.

I tried other examples such as hello and stencil3d. stencil3d throws the checksum error after a few load balancing steps. ./stencil3d 256 32 ++ppn 7 +commap 0 +pemap 1-7 +setcpuaffinity +balancer GreedyLB +LBDebug 1

PhilMiller commented 5 years ago

Original date: 2014-06-12 22:42:04


Could there be a relation between this and bug #401? I see bits about negative message sizes and corruption in both reports.

harshithamenon commented 5 years ago

Original date: 2014-06-18 19:12:04


The checksum crash happens for other program such as stencil3d so it doesn't seem like it is specific to ChaNGa.

PhilMiller commented 5 years ago

Original date: 2014-06-22 21:25:13


I've reproduced this with a slightly simpler setting on Eos:

Currently Loaded Modulefiles:
  1) modules/3.2.6.7
  2) eswrap/1.0.20-1.010200.643.0
  3) switch/1.0-1.0500.41328.1.120.ari
  4) craype-network-aries
  5) cray-mpich/6.0.2
  6) netcdf/4.3.0
  7) atp/1.6.3
  8) rca/1.0.0-2.0500.41336.1.120.ari
  9) dvs/2.3_0.9.0-1.0500.1522.1.180
 10) csa/3.0.0-1_2.0500.41366.1.129.ari
 11) job/1.5.5-0.1_2.0500.41368.1.92.ari
 12) xpmem/0.1-2.0500.41356.1.11.ari
 13) gni-headers/3.0-1.0500.7161.11.4.ari
 14) dmapp/6.0.1-1.0500.7263.9.31.ari
 15) pmi/4.0.1-1.0000.9725.84.2.ari
 16) ugni/5.0-1.0500.0.3.306.ari
 17) udreg/2.3.2-1.0500.6756.2.10.ari
 18) cray-libsci/12.1.01
 19) gcc/4.8.1
 20) craype/1.06
 21) craype-sandybridge
 22) altd/1.0
 23) lustredu/1.4
 24) DefApps
 25) git/1.8.3.4
 26) subversion/1.8.3
 27) hdf5/1.8.11
 28) cray-parallel-netcdf/1.3.1.1
 29) craype-hugepages8M
 30) PrgEnv-gnu/5.0.41

Note this is with GCC and not Intel compilers.

Charm++ build is gni-crayxc-hugepages (non-smp) built as ./build charm++ gni-crayxc hugepages -j12 -g (with gni/machine.c modified to set checksum_flag = 1). Command run was

 aprun -n 8 -N 1 -d 1 ./stencil3d.mono 256 32 +balancer RotateLB +LBDebug 1

The checksum disagreements always seem to happen at load balancing time. I'll hypothesize that the higher message traffic at those points is part of the cause.

PhilMiller commented 5 years ago

Original date: 2014-06-22 21:29:28


Simplifying even further, I can reproduce this on just two nodes, one PE on each, with smaller blocks, at thefirst LB:

> aprun -n 2 -N 1 -d 1 ./stencil3d.mono 128 16 +balancer RotateLB +LBDebug 3
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.6.0-rc3-81-g019db03
CharmLB> Verbose level 3, load balancing period: 0.5 seconds
CharmLB> Topology torus_nd_5 alpha: 3.500000e-05s beta: 8.500000e-09s.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 unique compute nodes (32-way SMP).
[0] RotateLB created

STENCIL COMPUTATION WITH BARRIERS
Running Stencil on 2 processors with (8, 8, 8) chares
Array Dimensions: 128 128 128
Block Dimensions: 16 16 16
[1] Time per iteration: 4.491426 4.561163
[2] Time per iteration: 6.003448 10.564633
[3] Time per iteration: 5.954097 16.518754
[4] Time per iteration: 5.948594 22.467371
[5] Time per iteration: 5.954061 28.421455
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!

[0] Stack Traceback:
  [0:0] [0x2014e2ae]
  [0:1] [0x2014e2f3]
  [0:2] [0x201512e3]
  [0:3] [0x20151e55]
  [0:4] [0x2014e0ec]
  [0:5] [0x2014e400]
  [0:6] [0x20159648]
aborting job:
Fatal error: checksum doesn't agree!

  [0:7] [0x2015a333]
  [0:8] [0x20155894]
  [0:9] [0x20155bad]
  [0:10] [0x20155ab8]
  [0:11] [0x2014e0b0]
  [0:12] [0x2014dfcd]
  [0:13] [0x2003e19a]
  [0:14] [0x20257491]
  [0:15] [0x20000629]
[NID 00564] 2014-06-22 17:28:00 Apid 554317: initiated application termination
Application 554317 exit codes: 255
Application 554317 exit signals: Killed
Application 554317 resources: utime ~30s, stime ~0s, Rss ~3524, inblocks ~34601, outblocks ~93155
pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-06-23 10:56:22


Actually, it might be the checksum code does not work. The one in git definitely is wrong since it always check whether checksum is zero or not. However, even i fixed it, it still fails even on hopper (Cray XE6). I am digging the problem now

nikhil-jain commented 5 years ago

Original date: 2014-06-23 14:07:59


The checksum code and its checking in git seems correct. The checksum of the incoming message is being compared to zero because it includes an Xor of a freshly computed checksum of the original message with the checksum of the original message stored in the message, which would be zero if nothing went wrong.

rbuch commented 5 years ago

Original date: 2014-06-23 22:14:40


Reproduced on Edison for NAMD with Intel compilers.

  1) modules/3.2.6.7                       11) pmi/5.0.3-1.0000.9981.128.2.ari       21) PrgEnv-intel/5.1.29
  2) nsg/1.2.0                             12) dmapp/7.0.1-1.0501.8315.8.4.ari       22) craype-ivybridge
  3) eswrap/1.1.0-1.010400.915.0           13) gni-headers/3.0-1.0501.8317.12.1.ari  23) cray-shmem/6.3.1
  4) switch/1.0-1.0501.47124.1.93.ari      14) xpmem/0.1-2.0501.48424.3.3.ari        24) cray-mpich/6.3.1
  5) craype-network-aries                  15) job/1.5.5-0.1_2.0501.48066.2.43.ari   25) torque/4.2.7
  6) craype/2.1.1                          16) csa/3.0.0-1_2.0501.47112.1.91.ari     26) moab/7.2.7-e7c070d1-b3-SUSE11
  7) intel/14.0.2.144                      17) dvs/2.4_0.9.0-1.0501.1672.2.122.ari   27) altd/1.0
  8) cray-libsci/12.2.0                    18) alps/5.1.1-2.0501.8471.1.1.ari        28) usg-default-modules/1.0
  9) udreg/2.3.2-1.0501.7914.1.13.ari      19) rca/1.0.0-2.0501.48090.7.46.ari       29) craype-hugepages8M
 10) ugni/5.0-1.0501.8253.10.22.ari        20) atp/1.7.2
aprun -n 24 -d 2 ./namd2 /global/homes/r/ronak/data/jac/jac.namd
Charm++> Running on Gemini (GNI) with 24 processes
Charm++> static SMSG
Charm++> SMSG memory: 118.5KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 24,  1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.6.0-rc3-82-gfa9e047
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 unique compute nodes (48-way SMP).
[ eliding NAMD startup output ]
Info: 
Info: Entering startup at 0.0792558 s, 1164.36 MB of memory in use
Info: Startup phase 0 took 0.000591237 s, 1164.36 MB of memory in use
------------- Processor 25 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!

aborting job:
Fatal error: checksum doesn't agree!

------------- Processor 36 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!

aborting job:
Fatal error: checksum doesn't agree!

[NID 00598] 2014-06-23 15:11:30 Apid 5821863: initiated application termination
Application 5821863 exit codes: 255
Application 5821863 exit signals: Killed
Application 5821863 resources: utime ~1s, stime ~3s, Rss ~11300, inblocks ~36580, outblocks ~95358
pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-06-24 03:12:10


I checked the code and after the checksum is set, some field (one seqID) is modified . Therefore, it is almost for sure it does not agree. Since before we never used checksum for testing, it is not carefully examined. I am fixing this.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-06-24 11:31:02


I found out the problem with checksum. At sender side, we did a checksum calculation. At receiver side, for RDMA transaction, the message size is aligned first. Therefore, the size becomes different. The checksum is wrong. I fixed this problem and checked in the fix in getrit. I tested it on stencil3d and it works. Please help check other applications (changa, namd) and see whether checksum problem still exists. http://charm.cs.uiuc.edu/gerrit/298

This also gives me a possible hint for the ChaNGa error. After alignment, RDMA transfer size is different from the original message size. this might lead to rdma transaction error. I am thinking of how to fix this.

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-06-24 14:45:14


I just checked in more change in the same getrit issue. It would be good to test whether ChaNGa can be fixed with this checkin. My network is too slow. Almost impossible to log in supercomputers (access to all US website is slow. gmail can be only accessed on phone. google does not work).

harshithamenon commented 5 years ago

Original date: 2014-06-24 18:43:28


I ran ChaNGa with your changes and it doesn't throw the checksum error but it still crashes.

Here is a run made on Piz Daint charm compiled with intel compilers, huge pages and without production

Charm++> Running on Gemini (GNI) with 128 processes Charm++> static SMSG Charm++> SMSG memory: 632.0KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> only comm thread send/recv messages Charm++> Cray TLB page size: 8192K Charm++> Running in SMP mode: numNodes 128, 7 worker threads per process Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: v6.6.0-rc3-83-g60ac501 CharmLB> Verbose level 1, load balancing period: 0.5 seconds CharmLB> Load balancer assumes all CPUs are same. CharmLB> Load balancing instrumentation for communication is off. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 1-7 Charm++> set comm 0 on node 0 to core #0 Charm++> Running on 128 unique compute nodes (16-way SMP). [0] MultistepLB_notopo created WARNING: bStandard parameter ignored; Output is always standard. WARNING: bOverwrite parameter ignored. ChaNGa version 3.0, commit v3.0-73-ge8a5b4b Running on 896 processors/ 128 nodes with 65536 TreePieces yieldPeriod set to 5 Prefetching...ON Number of chunks for remote tree walk set to 1 Chunk Randomization...ON cache 1 cacheLineDepth 4 Verbosity level 1 Domain decomposition...Oct Created 65536 pieces of tree Loading particles ... trying Tipsy ... took 2.119015 seconds. N: 75870839 Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657 Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000 WARNING: Could not open redshift input file: Eris_Core.red Initial domain decomposition ... bumping joinThreshold: 1157, size: 70186 bumping joinThreshold: 1272, size: 70186 bumping joinThreshold: 1399, size: 70186 bumping joinThreshold: 1538, size: 70186 bumping joinThreshold: 1691, size: 70186 bumping joinThreshold: 1860, size: 65566 Sorter: Histograms balanced after 52 iterations. Using 59755 chares. histogramming 0.0859755 sec ... [969] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 969 Exiting: Called CmiAbort ------------ Reason: aborting job:

------------- Processor 154 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

[918] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 918 Exiting: Called CmiAbort ------------ Reason: aborting job:

aborting job: Converse zero handler executed-- was a message corrupted? glibc detected ------------- Processor 427 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?

aborting job: Converse zero handler executed-- was a message corrupted?

[932] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 932 Exiting: Called CmiAbort ------------ Reason: aborting job:

Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c ------------- Processor 432 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? ------------- Processor 433 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? ------------- Processor 428 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) 40,> called with invalid index 25 (should be less than 0) Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c _pmiu_daemon(SIGCHLD): [NID 03399] [c7-1c2s1n3] [Tue Jun 24 20:36:36 2014] PE RANK 117 exit signal Segmentation fault [NID 03399] 2014-06-24 20:36:37 Apid 2499099: initiated application termination

pplimport commented 5 years ago

Original author: Yanhua Sun Original date: 2014-06-29 09:26:36


Is this crash producable on Edison? If so, can you give me detail instructions about how to run this system? What is the minimum nodes to reproduce?

harshithamenon commented 5 years ago

Original date: 2014-06-30 04:21:57


I am not able to reproduce the crash on Edison. I tried on 96, 192 and 384 cores. I tried with and without production. I also tried gnu and intel compilers.

PhilMiller commented 5 years ago

Original date: 2014-07-02 20:59:28


Does this message from the Edison operators suggest a variable that might have an effect on our observed results?

If your codes hang or run much slower after the maintenance on 6/25, please set the following environment variable in your job script:

setenv UGNI_CDM_MDD_DEDICATED 2 #for csh/tcsh users export UGNI_CDM_MDD_DEDICATED=2 #for bash shell users

so to disable the use of the shared Memory Domain Descriptors (MDDs) in your codes. For more details, please refer to our website,

https://www.nersc.gov/users/computational-systems/edison/updates-and-status/open-issues/if-your-codes-hang-or-run-much-slower-after-the-maintenance-on-6-25-14/?stage=Stage

jcphill commented 5 years ago

Original date: 2014-07-04 13:32:40


For NAMD on Edison Rafael reports: The option "setenv UGNI_CDM_MDD_DEDICATED 2" makes all the jobs that I tested crash in less than 10k steps.

PhilMiller commented 5 years ago

Original date: 2015-02-19 21:15:59


Fixed with commit 48b15267a08acd66fe4e9ba7fc40224517bb5c13 / change http://charm.cs.uiuc.edu/gerrit/#/c/324/