Closed PhilMiller closed 10 years ago
Original date: 2014-05-05 17:05:08
Someone who has the necessary knowledge and access to an XC30 on an appropriate allocation to address this issue needs to claim it. Possible machines include Eos at ORNL, Edison at NERSC, and PizDaint itself.
When building, running, and reporting results please document the compiler and build command used. Discussion, confirmations, and conflicting results are utterly meaningless without that.
Original author: Yanhua Sun Original date: 2014-05-07 16:25:06
Can Harshita or Lukasz (from the ChaNGa group) who has the Edison account reproduce this bug on Edison?
Original date: 2014-05-07 22:55:20
On Wed, 7 May 2014, Yanhua Sun wrote:
Hi Tom Phil said you reported a crash for XC30. Can I know how you build Charm++ to reproduce the bug?
First the modules I have loaded (note in particular I'm using GCC): 1) modules/3.2.6.7 2) eswrap/1.1.0-1.010400.915.0 3) switch/1.0-1.0501.47124.1.93.ari 4) craype-network-aries 5) craype/2.05 6) craype-sandybridge 7) slurm 8) cray-mpich/6.2.2 9) gcc/4.8.2 10) totalview-support/1.1.4 11) totalview/8.11.0 12) cray-libsci/12.1.3 13) udreg/2.3.2-1.0501.7914.1.13.ari 14) ugni/5.0-1.0501.8253.10.22.ari 15) pmi/5.0.2-1.0000.9906.117.2.ari 16) dmapp/7.0.1-1.0501.8315.8.4.ari 17) gni-headers/3.0-1.0501.8317.12.1.ari 18) xpmem/0.1-2.0501.48424.3.3.ari 19) job/1.5.5-0.1_2.0501.48066.2.43.ari 20) csa/3.0.0-1_2.0501.47112.1.91.ari 21) dvs/2.4_0.9.0-1.0501.1672.2.122.ari 22) alps/5.1.1-2.0501.8507.1.1.ari 23) rca/1.0.0-2.0501.48090.7.46.ari 24) atp/1.7.1 25) PrgEnv-gnu/5.1.29 26) craype-hugepages8M
And the build command itself:
./build ChaNGa gni-crayxc hugepages smp -j4 -O2
For ChaNGa it is: ./configure --enable-bigkeys; make
Original author: Yanhua Sun Original date: 2014-05-07 22:58:44
I do not know whether this is related to with production (without production). Since in NAMD it also has crashes in machine-broadcast when with-production is not used, I just suspect whether it is same in ChaNGa.
Tomm, Can I know how you build Charm++
Original author: Yanhua Sun Original date: 2014-05-07 23:03:42
Hi Tom
Can you try to build Charm with production?
./build ChaNGa gni-crayxc hugepages smp -j4 --with-production
Original date: 2014-05-07 23:20:38
building with:
./build ChaNGa gni-crayxc hugepages smp -j4 --with-production then building ChaNGa, gives a different error:
Charm++> Running on Gemini (GNI) with 128 processes Charm++> static SMSG Charm++> SMSG memory: 632.0KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> only comm thread send/recv messages Charm++> Cray TLB page size: 8192K Charm++> Running in SMP mode: numNodes 128, 7 worker threads per process Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: v6.6.0-rc3-34-gdaac868 CharmLB> Load balancer assumes all CPUs are same. CharmLB> Load balancing instrumentation for communication is off. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 1-7 Charm++> set comm 0 on node 0 to core #0 Charm++> Running on 128 unique compute nodes (16-way SMP). [0] MultistepLB_notopo created WARNING: bStandard parameter ignored; Output is always standard. WARNING: bOverwrite parameter ignored. ChaNGa version 3.0, commit v3.0-64-g2698882 Running on 896 processors/ 128 nodes with 65536 TreePieces yieldPeriod set to 5 Prefetching...ON Number of chunks for remote tree walk set to 1 Chunk Randomization...ON cache 1 cacheLineDepth 4 Verbosity level 1 Domain decomposition...SFC Peano-Hilbert Created 65536 pieces of tree Loading particles ... trying Tipsy ... took 2.452627 seconds. N: 75870839 Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657 Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000 WARNING: Could not open redshift input file: Eris_Core.red Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations . ------------- Processor 901 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2012.938MB) ------------- Processor 140 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 142 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 143 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 144 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 665 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 145 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 141 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 146 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 1022 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [1022] registerFromMempool; err=GNI_RC_INVALID_PARAM ------------- Processor 926 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2004.550MB) ------------- Processor 1012 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [1012] registerFromMempool; err=GNI_RC_INVALID_PARAM ------------- Processor 991 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 1996.161MB) ------------- Processor 1021 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2038.104MB) ------------- Processor 854 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
------------- Processor 965 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 2021.327MB) ------------- Processor 983 Exiting: Called CmiAbort ------------ Reason: GNI_RC_CHECK [983] registerFromMempool; err=GNI_RC_PERMISSION_ERROR [901] Stack Traceback: [901:0] [0x20246125] [901:1] [0x201aa611] [901:2] [0x2024ff3f] [901:3] [0x2024a470] [901:4] [0x2024b64b] [901:5] [0x2024d01e] [901:6] [0x2024d3bd] [901:7] [0x2024d445] [901:8] [0x202776f6] [901:9] [0x203ad2f9] aborting job: Could not malloc()--are we out of memory? (used: 2012.938MB)
Here are the symbols for that traceback:
trq`daint104:~/scratch_daint> gdb ./ChaNGa.gni
GNU gdb (GDB) SUSE (7.5.1-1.0000.0.3.1)
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /scratch/daint/trq/ChaNGa.gni...done.
(gdb) x/i 0x20246125
0x20246125 <CmiAbortHelper+101>: add $0x8,%rsp
(gdb) x/i 0x201aa611
0x201aa611 <CmiOutOfMemory+97>: add $0xd8,%rsp
(gdb) x/i 0x2024ff3f
0x2024ff3f <CmiAlloc+63>: mov 0x8(%rsp),%rax
(gdb) x/i 0x2024a470
0x2024a470 <PumpNetworkSmsg+656>: mov 0x48(%rsp),%rsi
(gdb) x/i 0x2024b64b
0x2024b64b <LrtsAdvanceCommunication+27>:
callq 0x20248850
Original author: Yanhua Sun Original date: 2014-05-08 01:38:00
It seems it runs out of memory. Maybe you can try the same particle system using more nodes?
Original date: 2014-05-08 03:02:12
I'll give it a go, but note that 2000 MB is less than 10% of the available memory/node on Piz Daint.
Original date: 2014-05-08 15:40:50
I ran it on 256 nodes and a similar error message: . . . Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations . ------------- Processor 1879 Exiting: Called CmiAbort ------------ Reason: Could not malloc()--are we out of memory? (used: 1761.702MB)
Original date: 2014-05-08 16:49:00
I decided to find out how much memory it was asking for when it failed: I stuck in a statement that looks like:
diff --git a/src/conv-core/convcore.c b/src/conv-core/convcore.c
index 59b7bf4..152c058 100644
--- a/src/conv-core/convcore.c
+++ b/src/conv-core/convcore.c
`` -2876,6 +2876,7 `` void *CmiAlloc(int size)
res =(char *) malloc_nomigrate(size+sizeof(CmiChunkHeader));
#endif
+ if(res == NULL) CmiError("Failed malloc of %d bytes\n", size);
_MEMCHECK(res);
#ifdef MEMMONITOR
And (running on 128 nodes again) I get:
.
.
.
Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations
.
Failed malloc of -1610612736 bytes
Failed malloc of -1073741824 bytes
Failed malloc of -1073741824 bytes
Failed malloc of -2147483648 bytes
Failed malloc of -313925248 bytes
Failed malloc of -1610612736 bytes
------------- Processor 927 Exiting: Called CmiAbort ------------
Reason: Could not malloc() -1 bytes--are we out of memory? (used :1870.332MB)
------------- Processor 970 Exiting: Called CmiAbort ------------
Reason: Could not malloc() -1 bytes--are we out of memory? (used :1778.057MB)
Failed malloc of -536870912 bytes
Failed malloc of -2147483648 bytes
Failed malloc of -165821712 bytes
Failed malloc of -247029697 bytes
------------- Processor 925 Exiting: Called CmiAbort ------------
.
.
.
I assume we don't expect malloc() to be happy with asking for negative amounts of memory.
Original date: 2014-05-12 16:20:49
It is far-fetched, but may be worth trying - yesterday I found a bug in typedef for INT8 on GNI-based machines. Phil implemented a generic fix which was just merged. Can Harshitha/Tom give a shot with current master?
Original date: 2014-05-12 17:56:03
Not good: it doesn't compile:
../bin/charmc -O2 -I. -c -o RandCentLB.o RandCentLB.C In file included from lbdb.h:9:0, from LBDatabase.h:9, from BaseLB.h:9, from CentralLB.h:9, from DummyLB.h:9, from DummyLB.C:6: converse.h:617:22: error: conflicting declaration 'typedef CmiUInt8 CmiIntPtr' typedef CmiUInt8 CmiIntPtr; ^
git show: commit e860e5b24cf003c9703578b726e5ca9beec8b5d0 Author: Phil Miller <mille121`illinois.edu> Date: Sun May 11 22:53:19 2014 -0500
cc --version: gcc (GCC) 4.8.2 20131016 (Cray Inc.)
cat /etc/SUSE-release: SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 2
Original date: 2014-05-12 18:33:05
Sorry about that. SMP compilation was not tested - Phil checked in a fix. Please try again.
Original date: 2014-05-14 16:35:13
I tested with dwf1.6144 data against the master branch of charm and it crashes.
Original author: Yanhua Sun Original date: 2014-05-14 16:47:54
Currently, I have no account on Edison to debug this problem. The turn round time on EOS is pretty long.
Original author: Yanhua Sun Original date: 2014-05-20 05:15:07
I have run O3 nonsmp ApoA1 NAMD with cmimemcpy_qpx and memcpy on 1 core , 16 cores. There is almost no difference for performance. We can probably just use memcpy
Original author: Yanhua Sun Original date: 2014-05-28 17:51:15
What is the progress of this problem? Currently this one seems to be only problem preventing from releasing.
Original date: 2014-05-28 18:00:09
Here is a clip of the latest email exchange. Bottom line: it works on EOS, but still fails on Piz Daint:
looks like Piz Daint has newer software stack than eos. Here is what I have on eos.
1) modules/3.2.6.7 2) eswrap/1.0.20-1.010200.643.0 3) switch/1.0-1.0500.41328.1.120.ari 4) craype-network-aries 5) cray-mpich/6.0.2 6) atp/1.6.3 7) rca/1.0.0-2.0500.41336.1.120.ari 8) dvs/2.3_0.9.0-1.0500.1522.1.180 9) csa/3.0.0-1_2.0500.41366.1.129.ari 10) job/1.5.5-0.1_2.0500.41368.1.92.ari 11) xpmem/0.1-2.0500.41356.1.11.ari 12) gni-headers/3.0-1.0500.7161.11.4.ari 13) dmapp/6.0.1-1.0500.7263.9.31.ari 14) pmi/4.0.1-1.0000.9725.84.2.ari 15) ugni/5.0-1.0500.0.3.306.ari 16) udreg/2.3.2-1.0500.6756.2.10.ari 17) cray-libsci/12.1.01 18) gcc/4.8.1 19) craype/1.06 20) craype-sandybridge 21) altd/1.0 22) lustredu/1.3 23) DefApps 24) PrgEnv-gnu/5.0.41 25) git/1.8.3.4 26) cray-hdf5/1.8.11 27) netcdf/4.3.0 28) fftw/3.3.0.4 29) craype-hugepages8M 30) stat/2.0.0.1
Gengbin
On 5/27/2014 1:18 PM, Tom Quinn wrote:
Here is what I have on Piz Daint:
module list Currently Loaded Modulefiles: 1) modules/3.2.6.7 2) eswrap/1.1.0-1.010400.915.0 3) switch/1.0-1.0501.47124.1.93.ari 4) craype-network-aries 5) craype/2.05 6) craype-sandybridge 7) slurm 8) cray-mpich/6.2.2 9) gcc/4.8.2 10) totalview-support/1.1.4 11) totalview/8.11.0 12) cray-libsci/12.1.3 13) udreg/2.3.2-1.0501.7914.1.13.ari 14) ugni/5.0-1.0501.8253.10.22.ari 15) pmi/5.0.2-1.0000.9906.117.2.ari 16) dmapp/7.0.1-1.0501.8315.8.4.ari 17) gni-headers/3.0-1.0501.8317.12.1.ari 18) xpmem/0.1-2.0501.48424.3.3.ari 19) job/1.5.5-0.1_2.0501.48066.2.43.ari 20) csa/3.0.0-1_2.0501.47112.1.91.ari 21) dvs/2.4_0.9.0-1.0501.1672.2.122.ari 22) alps/5.1.1-2.0501.8507.1.1.ari 23) rca/1.0.0-2.0501.48090.7.46.ari 24) atp/1.7.1 25) PrgEnv-gnu/5.1.29 26) craype-hugepages8M
Tom Quinn Astronomy, University of Washington Internet: trq`astro.washington.edu Phone: 206-685-9009
On Tue, 27 May 2014, Gengbin Zheng wrote:
Maybe we should look and compare the module versions on both machines (Piz Paint and EOS). I am not sure how much time I have this week looking into this problem to be able to repeat the bug before I go to China next week (for a month). I have some SC paper to review too, due this Saturday.
Gengbin On Tue, May 27, 2014 at 10:33 AM, Tom Quinn <trq`astro.washington.edu> wrote: I've tried to use all the same command line arguments, and I still get: ... Loading particles ... trying Tipsy ... took 2.490938 seconds. N: 75870839 Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657 Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000 WARNING: Could not open redshift input file: Eris_Core.red Initial domain decomposition ... Sorter: Histograms balanced after 25 iterations . [1023] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-br oadcast.c line 54. ------------- Processor 1023 Exiting: Called CmiAbort ------------
One thing that is different: Piz Daint only has 8 physical cores/node, although hyperthread is enabled. Does EOS have 16 physical cores, or are you using hyperthreading?
Tom Quinn Astronomy, University of Washington Internet: trq`astro.washington.edu Phone: 206-685-9009
On Fri, 23 May 2014, Gengbin Zheng wrote:
I just ran this benchmark on 128 nodes of eos, it works for me. My command line:
aprun -n 256 -N 2 ./ChaNGa +stacksize 2000000 -D 3 -p 65536 -wall 30 -killat 2000 T_S.param ++ppn 15 +commap 0,16 +pemap 1-15,17-31 +setcpuaffinity +balancer MultistepLB_notopo +LBPeriod 0.0 +LBDebug 1 +noAnytimeMigration +LBCommOff
Gengbin
Original date: 2014-06-09 02:55:33
Turning on the checksum_flag in machine.c throws 'check sum doesn't agree' error on eos and Piz Daint for smaller number of cores as well and for smaller datasets.
On Piz Daint
Charm was built the following way with intel compiler and checksum_flag set. ./build ChaNGa gni-crayxc hugepages smp -j8
ChaNGa was run on cube300 dataset on 8 nodes in smp mode ./ChaNGa +stacksize 2000000 -D 1 -wall 5 cube300.param ++ppn 7 +commap 0 +pemap 1-7 +setcpuaffinity
Error thrown Checksum doesn't agree!
The checksum error is happening in machine.c in the gni-crayxc layer in the PumpLocalTransactions function in line 2932 where the checksum is checked.
I tried other examples such as hello and stencil3d. stencil3d throws the checksum error after a few load balancing steps. ./stencil3d 256 32 ++ppn 7 +commap 0 +pemap 1-7 +setcpuaffinity +balancer GreedyLB +LBDebug 1
Original date: 2014-06-12 22:42:04
Could there be a relation between this and bug #401? I see bits about negative message sizes and corruption in both reports.
Original date: 2014-06-18 19:12:04
The checksum crash happens for other program such as stencil3d so it doesn't seem like it is specific to ChaNGa.
Original date: 2014-06-22 21:25:13
I've reproduced this with a slightly simpler setting on Eos:
Currently Loaded Modulefiles:
1) modules/3.2.6.7
2) eswrap/1.0.20-1.010200.643.0
3) switch/1.0-1.0500.41328.1.120.ari
4) craype-network-aries
5) cray-mpich/6.0.2
6) netcdf/4.3.0
7) atp/1.6.3
8) rca/1.0.0-2.0500.41336.1.120.ari
9) dvs/2.3_0.9.0-1.0500.1522.1.180
10) csa/3.0.0-1_2.0500.41366.1.129.ari
11) job/1.5.5-0.1_2.0500.41368.1.92.ari
12) xpmem/0.1-2.0500.41356.1.11.ari
13) gni-headers/3.0-1.0500.7161.11.4.ari
14) dmapp/6.0.1-1.0500.7263.9.31.ari
15) pmi/4.0.1-1.0000.9725.84.2.ari
16) ugni/5.0-1.0500.0.3.306.ari
17) udreg/2.3.2-1.0500.6756.2.10.ari
18) cray-libsci/12.1.01
19) gcc/4.8.1
20) craype/1.06
21) craype-sandybridge
22) altd/1.0
23) lustredu/1.4
24) DefApps
25) git/1.8.3.4
26) subversion/1.8.3
27) hdf5/1.8.11
28) cray-parallel-netcdf/1.3.1.1
29) craype-hugepages8M
30) PrgEnv-gnu/5.0.41
Note this is with GCC and not Intel compilers.
Charm++ build is gni-crayxc-hugepages
(non-smp) built as ./build charm++ gni-crayxc hugepages -j12 -g
(with gni/machine.c modified to set checksum_flag = 1). Command run was
aprun -n 8 -N 1 -d 1 ./stencil3d.mono 256 32 +balancer RotateLB +LBDebug 1
The checksum disagreements always seem to happen at load balancing time. I'll hypothesize that the higher message traffic at those points is part of the cause.
Original date: 2014-06-22 21:29:28
Simplifying even further, I can reproduce this on just two nodes, one PE on each, with smaller blocks, at thefirst LB:
> aprun -n 2 -N 1 -d 1 ./stencil3d.mono 128 16 +balancer RotateLB +LBDebug 3
Charm++> Running on Gemini (GNI) with 2 processes
Charm++> static SMSG
Charm++> SMSG memory: 9.9KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> Cray TLB page size: 8192K
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.6.0-rc3-81-g019db03
CharmLB> Verbose level 3, load balancing period: 0.5 seconds
CharmLB> Topology torus_nd_5 alpha: 3.500000e-05s beta: 8.500000e-09s.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 unique compute nodes (32-way SMP).
[0] RotateLB created
STENCIL COMPUTATION WITH BARRIERS
Running Stencil on 2 processors with (8, 8, 8) chares
Array Dimensions: 128 128 128
Block Dimensions: 16 16 16
[1] Time per iteration: 4.491426 4.561163
[2] Time per iteration: 6.003448 10.564633
[3] Time per iteration: 5.954097 16.518754
[4] Time per iteration: 5.948594 22.467371
[5] Time per iteration: 5.954061 28.421455
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!
[0] Stack Traceback:
[0:0] [0x2014e2ae]
[0:1] [0x2014e2f3]
[0:2] [0x201512e3]
[0:3] [0x20151e55]
[0:4] [0x2014e0ec]
[0:5] [0x2014e400]
[0:6] [0x20159648]
aborting job:
Fatal error: checksum doesn't agree!
[0:7] [0x2015a333]
[0:8] [0x20155894]
[0:9] [0x20155bad]
[0:10] [0x20155ab8]
[0:11] [0x2014e0b0]
[0:12] [0x2014dfcd]
[0:13] [0x2003e19a]
[0:14] [0x20257491]
[0:15] [0x20000629]
[NID 00564] 2014-06-22 17:28:00 Apid 554317: initiated application termination
Application 554317 exit codes: 255
Application 554317 exit signals: Killed
Application 554317 resources: utime ~30s, stime ~0s, Rss ~3524, inblocks ~34601, outblocks ~93155
Original author: Yanhua Sun Original date: 2014-06-23 10:56:22
Actually, it might be the checksum code does not work. The one in git definitely is wrong since it always check whether checksum is zero or not. However, even i fixed it, it still fails even on hopper (Cray XE6). I am digging the problem now
Original date: 2014-06-23 14:07:59
The checksum code and its checking in git seems correct. The checksum of the incoming message is being compared to zero because it includes an Xor of a freshly computed checksum of the original message with the checksum of the original message stored in the message, which would be zero if nothing went wrong.
Original date: 2014-06-23 22:14:40
Reproduced on Edison for NAMD with Intel compilers.
1) modules/3.2.6.7 11) pmi/5.0.3-1.0000.9981.128.2.ari 21) PrgEnv-intel/5.1.29
2) nsg/1.2.0 12) dmapp/7.0.1-1.0501.8315.8.4.ari 22) craype-ivybridge
3) eswrap/1.1.0-1.010400.915.0 13) gni-headers/3.0-1.0501.8317.12.1.ari 23) cray-shmem/6.3.1
4) switch/1.0-1.0501.47124.1.93.ari 14) xpmem/0.1-2.0501.48424.3.3.ari 24) cray-mpich/6.3.1
5) craype-network-aries 15) job/1.5.5-0.1_2.0501.48066.2.43.ari 25) torque/4.2.7
6) craype/2.1.1 16) csa/3.0.0-1_2.0501.47112.1.91.ari 26) moab/7.2.7-e7c070d1-b3-SUSE11
7) intel/14.0.2.144 17) dvs/2.4_0.9.0-1.0501.1672.2.122.ari 27) altd/1.0
8) cray-libsci/12.2.0 18) alps/5.1.1-2.0501.8471.1.1.ari 28) usg-default-modules/1.0
9) udreg/2.3.2-1.0501.7914.1.13.ari 19) rca/1.0.0-2.0501.48090.7.46.ari 29) craype-hugepages8M
10) ugni/5.0-1.0501.8253.10.22.ari 20) atp/1.7.2
aprun -n 24 -d 2 ./namd2 /global/homes/r/ronak/data/jac/jac.namd
Charm++> Running on Gemini (GNI) with 24 processes
Charm++> static SMSG
Charm++> SMSG memory: 118.5KB
Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit)
Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB
Charm++> only comm thread send/recv messages
Charm++> Cray TLB page size: 8192K
Charm++> Running in SMP mode: numNodes 24, 1 worker threads per process
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.6.0-rc3-82-gfa9e047
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 2 unique compute nodes (48-way SMP).
[ eliding NAMD startup output ]
Info:
Info: Entering startup at 0.0792558 s, 1164.36 MB of memory in use
Info: Startup phase 0 took 0.000591237 s, 1164.36 MB of memory in use
------------- Processor 25 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!
aborting job:
Fatal error: checksum doesn't agree!
------------- Processor 36 Exiting: Called CmiAbort ------------
Reason: Fatal error: checksum doesn't agree!
aborting job:
Fatal error: checksum doesn't agree!
[NID 00598] 2014-06-23 15:11:30 Apid 5821863: initiated application termination
Application 5821863 exit codes: 255
Application 5821863 exit signals: Killed
Application 5821863 resources: utime ~1s, stime ~3s, Rss ~11300, inblocks ~36580, outblocks ~95358
Original author: Yanhua Sun Original date: 2014-06-24 03:12:10
I checked the code and after the checksum is set, some field (one seqID) is modified . Therefore, it is almost for sure it does not agree. Since before we never used checksum for testing, it is not carefully examined. I am fixing this.
Original author: Yanhua Sun Original date: 2014-06-24 11:31:02
I found out the problem with checksum. At sender side, we did a checksum calculation. At receiver side, for RDMA transaction, the message size is aligned first. Therefore, the size becomes different. The checksum is wrong. I fixed this problem and checked in the fix in getrit. I tested it on stencil3d and it works. Please help check other applications (changa, namd) and see whether checksum problem still exists. http://charm.cs.uiuc.edu/gerrit/298
This also gives me a possible hint for the ChaNGa error. After alignment, RDMA transfer size is different from the original message size. this might lead to rdma transaction error. I am thinking of how to fix this.
Original author: Yanhua Sun Original date: 2014-06-24 14:45:14
I just checked in more change in the same getrit issue. It would be good to test whether ChaNGa can be fixed with this checkin. My network is too slow. Almost impossible to log in supercomputers (access to all US website is slow. gmail can be only accessed on phone. google does not work).
Original date: 2014-06-24 18:43:28
I ran ChaNGa with your changes and it doesn't throw the checksum error but it still crashes.
Here is a run made on Piz Daint charm compiled with intel compilers, huge pages and without production
Charm++> Running on Gemini (GNI) with 128 processes Charm++> static SMSG Charm++> SMSG memory: 632.0KB Charm++> memory pool init block size: 8MB, total memory pool limit 0MB (0 means no limit) Charm++> memory pool registered memory limit: 200000MB, send limit: 100000MB Charm++> only comm thread send/recv messages Charm++> Cray TLB page size: 8192K Charm++> Running in SMP mode: numNodes 128, 7 worker threads per process Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: v6.6.0-rc3-83-g60ac501 CharmLB> Verbose level 1, load balancing period: 0.5 seconds CharmLB> Load balancer assumes all CPUs are same. CharmLB> Load balancing instrumentation for communication is off. Charm++> cpu affinity enabled. Charm++> cpuaffinity PE-core map : 1-7 Charm++> set comm 0 on node 0 to core #0 Charm++> Running on 128 unique compute nodes (16-way SMP). [0] MultistepLB_notopo created WARNING: bStandard parameter ignored; Output is always standard. WARNING: bOverwrite parameter ignored. ChaNGa version 3.0, commit v3.0-73-ge8a5b4b Running on 896 processors/ 128 nodes with 65536 TreePieces yieldPeriod set to 5 Prefetching...ON Number of chunks for remote tree walk set to 1 Chunk Randomization...ON cache 1 cacheLineDepth 4 Verbosity level 1 Domain decomposition...Oct Created 65536 pieces of tree Loading particles ... trying Tipsy ... took 2.119015 seconds. N: 75870839 Input file, Time:0.100882 Redshift:1.647904 Expansion factor:0.377657 Simulation to Time:0.343669 Redshift:0.000000 Expansion factor:1.000000 WARNING: Could not open redshift input file: Eris_Core.red Initial domain decomposition ... bumping joinThreshold: 1157, size: 70186 bumping joinThreshold: 1272, size: 70186 bumping joinThreshold: 1399, size: 70186 bumping joinThreshold: 1538, size: 70186 bumping joinThreshold: 1691, size: 70186 bumping joinThreshold: 1860, size: 65566 Sorter: Histograms balanced after 52 iterations. Using 59755 chares. histogramming 0.0859755 sec ... [969] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 969 Exiting: Called CmiAbort ------------ Reason: aborting job:
------------- Processor 154 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
[918] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 918 Exiting: Called CmiAbort ------------ Reason: aborting job:
aborting job: Converse zero handler executed-- was a message corrupted? glibc detected ------------- Processor 427 Exiting: Called CmiAbort ------------ Reason: Converse zero handler executed-- was a message corrupted?
aborting job: Converse zero handler executed-- was a message corrupted?
[932] Assertion "((CmiMsgHeaderBasic *)msg)->rank==0" failed in file machine-broadcast.c line 54. ------------- Processor 932 Exiting: Called CmiAbort ------------ Reason: aborting job:
Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c ------------- Processor 432 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? ------------- Processor 433 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? ------------- Processor 428 Exiting: Called CmiAbort ------------ Reason: Registered idx is out of bounds-- is message or memory corrupted? aborting job: Registered idx is out of bounds-- is message or memory corrupted? register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) register.h> CkRegisteredInfo<40,> called with invalid index 25 (should be less than 0) 40,> called with invalid index 25 (should be less than 0) Warning: GNI_PostRdma: ioctl(GNI_IOC_POST_RDMA) returned error - Invalid argument at line 157 in file rdma_transfer.c _pmiu_daemon(SIGCHLD): [NID 03399] [c7-1c2s1n3] [Tue Jun 24 20:36:36 2014] PE RANK 117 exit signal Segmentation fault [NID 03399] 2014-06-24 20:36:37 Apid 2499099: initiated application termination
Original author: Yanhua Sun Original date: 2014-06-29 09:26:36
Is this crash producable on Edison? If so, can you give me detail instructions about how to run this system? What is the minimum nodes to reproduce?
Original date: 2014-06-30 04:21:57
I am not able to reproduce the crash on Edison. I tried on 96, 192 and 384 cores. I tried with and without production. I also tried gnu and intel compilers.
Original date: 2014-07-02 20:59:28
Does this message from the Edison operators suggest a variable that might have an effect on our observed results?
If your codes hang or run much slower after the maintenance on 6/25, please set the following environment variable in your job script:
setenv UGNI_CDM_MDD_DEDICATED 2 #for csh/tcsh users export UGNI_CDM_MDD_DEDICATED=2 #for bash shell users
so to disable the use of the shared Memory Domain Descriptors (MDDs) in your codes. For more details, please refer to our website,
Original date: 2014-07-04 13:32:40
For NAMD on Edison Rafael reports: The option "setenv UGNI_CDM_MDD_DEDICATED 2" makes all the jobs that I tested crash in less than 10k steps.
Original date: 2015-02-19 21:15:59
Fixed with commit 48b15267a08acd66fe4e9ba7fc40224517bb5c13 / change http://charm.cs.uiuc.edu/gerrit/#/c/324/
Original issue: https://charm.cs.illinois.edu/redmine/issues/486
Per Tom, running on the Swiss machine PizDaint: