SCOREC / EnGPar

dynamic load balancing
http://scorec.github.io/EnGPar/
BSD 3-Clause "New" or "Revised" License
7 stars 4 forks source link

failure running splitAndBalanceMesh on large mixed mesh #35

Open cwsmith opened 3 years ago

cwsmith commented 3 years ago

After splitting, during hypergraph creation (mesh elements are graph nodes, mesh vertices -> hyperedges), the pin connection process segfaults with the a bad allocation here:

https://github.com/SCOREC/EnGPar/blob/68853710c2dc936006d0eb9446434724f8a21ce6/interfaces/apfGraph.cpp#L293

The following print statement was added before that line to help determine the cause:

fprintf(stderr, "%d nlp %d nle %d\n", PCU_Comm_Self(), nlp, nle);

The output from a run on Pleiades is pasted below. Two ranks print large negative values for nlp:

57 nlp -1146837624 nle 51071
56 nlp -841074879 nle 52681

input mesh and model

/projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN

stdout/err

> mpiexec -n 120 ~/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 
ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6
mesh outMesh/ loaded in 3.733180 seconds
number of tet 18685187 hex 0 prism 9682240 pyramid 0
mesh entity counts: v 8009471 e 41503118 f 61787347 r 28367427
planned Zoltan split factor 2 to target imbalance 1.100000 in 2.935652 seconds
mesh expanded from 60 to 120 parts in 1.240766 seconds
mesh migrated from 60 to 120 in 11.678687 seconds
PARMA_STATUS disconnected <max avg> 0 0.000
PARMA_STATUS neighbors <max avg> 19 9.983
PARMA_STATUS smallest side of max neighbor part 29
PARMA_STATUS num parts with max neighbors 1
PARMA_STATUS empty parts 0
PARMA_STATUS small neighbor counts 1:0 2:2 3:0 4:0 5:0 6:2 7:0 8:0 9:2 10:0 
PARMA_STATUS weighted vtx <tot max min avg> 8682895.0 112373.0 34920.0 72357.458
PARMA_STATUS weighted edge <tot max min avg> 43045935.0 435313.0 229585.0 358716.125
PARMA_STATUS weighted face <tot max min avg> 62730587.0 589500.0 355227.0 522754.892
PARMA_STATUS weighted rgn <tot max min avg> 28367427.0 290030.0 142546.0 236395.225
PARMA_STATUS owned bdry vtx <tot max min avg> 576843 12960 0 4807.025
PARMA_STATUS shared bdry vtx <tot max min avg> 1176539 16528 1510 9804.492

error rPARMA_STATUS model bdry vtx <tot max min avg> 266000 10172 0 2216.667
PARMA_STATUS sharedSidesToElements <max min avg> 0.111 0.010 0.068
PARMA_STATUS entity imbalance <v e f r>: 1.55 1.21 1.13 1.23
99 nlp 79993306 nle 39737
59 nlp -267509587 nle 50000
terminate called after throwing an instance of ‘std::bad_array_new_length’
 what(): std::bad_array_new_length
108 nlp 952788 nle 41193
57 nlp -1146837624 nle 51071
112 nlp 1046825 nle 45320
terminate called after throwing an instance of ‘std::bad_array_new_length’
 what(): std::bad_array_new_length
51 nlp 927740 nle 39326
102 nlp 1192183 nle 51189
109 nlp 838136 nle 34920
33 nlp 977030 nle 41241
92 nlp 1045943 nle 44215
93 nlp 1111342 nle 47043
118 nlp 1160873 nle 49411
113 nlp 1070532 nle 45309
117 nlp 1082932 nle 45770
MPT ERROR: Rank 59(g:59) received signal SIGABRT/SIGIOT(6).
    Process ID: 91212, Host: r575i6n4, Program: /home5/kjansen/buildengpar/test/splitAndBalanceMesh
    MPT Version: HPE MPT 2.17 11/30/17 08:08:29
MPT: --------stack traceback-------
36 nlp 1061060 nle 44968
37 nlp 1110348 nle 46773
104 nlp 997048 nle 42346
105 nlp 1089882 nle 46158
98 nlp 1240290 nle 53230
34 nlp 1147389 nle 48452
90 nlp 1083250 nle 45724
54 nlp 1159270 nle 48648
107 nlp 1133343 nle 47904
101 nlp 1915291810 nle 49541
119 nlp 1222668 nle 51540
100 nlp 1231264 nle 52035
27 nlp 1228810 nle 52584
43 nlp 1174608 nle 49490
52 nlp 1156365 nle 48880
103 nlp 1221699 nle 51715
91 nlp 1162613 nle 100807
47 nlp 1180836 nle 49838
39 nlp 1229520 nle 51761
55 nlp 1359121 nle 52562
106 nlp 1097030 nle 92104
96 nlp 1047691 nle 71326
58 nlp -1569407485 nle 52478
45 nlp 1156956 nle 48946
53 nlp 1238186 nle 52231
110 nlp 1140810 nle 48316
115 nlp 1232744 nle 52366
terminate called after throwing an instance of ‘std::bad_array_new_length’
46 nlp 1206098 nle 50625
48 nlp 1204564 nle 50627
32 nlp 1150728 nle 48542
83 nlp 990491 nle 50397
111 nlp 1124040 nle 46817
 what(): std::bad_array_new_length
42 nlp 1238392 nle 52403
31 nlp 1247900 nle 52705
50 nlp 1231181 nle 51662
38 nlp 1261028 nle 52832
56 nlp -841074879 nle 52681
44 nlp 1262759 nle 53363
30 nlp 1254895 nle 52612
35 nlp 1267186 nle 53494
terminate called after throwing an instance of ‘std::bad_array_new_length’
41 nlp 1229760 nle 51681
64 nlp 1256866 nle 75894
28 nlp 1253121 nle 52910
 what(): std::bad_array_new_length
114 nlp 1265824 nle 53089
24 nlp 1217616 nle 50886
49 nlp 1260068 nle 52762
82 nlp 890976 nle 68804
29 nlp 1265666 nle 53055
17 nlp 1211961 nle 62727
66 nlp 1097870 nle 93892
78 nlp 923136 nle 77637
116 nlp 1266674 nle 53357
40 nlp 1212876 nle 50810
80 nlp 971814 nle 82007
84 nlp 1239423 nle 105484
15 nlp 1290996 nle 95015
26 nlp 1268241 nle 52986
75 nlp 1067598 nle 73259
77 nlp 1116858 nle 79411
73 nlp 1038174 nle 87581
69 nlp 1223016 nle 104134
61 nlp 1242092 nle 81230
97 nlp 1175911 nle 74222
6 nlp 1276641 nle 108635
60 nlp 1260873 nle 108003
95 nlp 1265598 nle 108429
8 nlp 1203720 nle 102109
22 nlp 1282856 nle 70829
86 nlp 1124802 nle 94993
88 nlp 1283457 nle 90131
21 nlp 1297034 nle 91652
25 nlp 1300891 nle 74589
70 nlp 1161988 nle 97269
63 nlp 1293464 nle 84892
89 nlp 1295568 nle 110380
3 nlp 1227291 nle 104297
18 nlp 1307321 nle 96492
74 nlp 1299909 nle 110652
13 nlp 1263841 nle 105356
68 nlp 1268243 nle 82514
4 nlp 1208871 nle 103201
81 nlp 1314015 nle 102439
19 nlp 1289680 nle 89270
72 nlp 1243826 nle 81195
71 nlp 1315848 nle 110669
85 nlp 1256709 nle 106822
79 nlp 1252092 nle 106791
65 nlp 1251204 nle 106011
76 nlp 1221766 nle 95822
94 nlp 1321065 nle 111892
20 nlp 1310938 nle 92005
23 nlp 1310037 nle 88603
9 nlp 1261650 nle 106179
1 nlp 1309599 nle 111764
2 nlp 1317027 nle 112373
10 nlp 1307088 nle 110946
67 nlp 1228118 nle 94014
62 nlp 1227882 nle 88072
14 nlp 1276752 nle 103633
12 nlp 1210623 nle 95865
16 nlp 1331516 nle 96747
87 nlp 1343556 nle 111219
7 nlp 1304574 nle 109850
11 nlp 1275648 nle 106917
5 nlp 1325682 nle 111950
0 nlp 1319316 nle 107363
diamog commented 3 years ago

My first intuition is telling me the problem is an integer overflow. This line is what is troubling me along with the negatives: 101 nlp 1915291810 nle 49541

With the 55% vertex imbalance, I am assuming there are a few parts with a lot of vertices and those vertices are bounding a lot of elements (likely made worse by the mixed elements).

You can try running with the local id type as a 64-bit int and see if that fixes the problem.

KennethEJansen commented 3 years ago

I would expect this on the target case (278M verts) but this case is only 8M verts so this would be surprising. I will see if I can figure out how to rebuild the code with 64 bit int.

KennethEJansen commented 3 years ago

-DMDS_SET_MAX=1024 \ -DMDS_ID_TYPE=long \

robbed these from chef....is either of these what I am looking for?

KennethEJansen commented 3 years ago

Hacking around I think I found the right flag:

kjansen@pfe25:~/buildengpar> more ../engpar/doconfig.sh 
#!/bin/bash
[ $# -ne 2 ] && echo "Usage: $0 <absolute path to source> <scorec core install>"
 && exit 1
src=$1
[ ! -e $src ] && echo "source dir $src does not exist!" && exit 1
core=$2
[ ! -e $core ] && echo "SCOREC core dir $core does not exist!" && exit 1
cmake $src \
    -DCMAKE_C_COMPILER="mpicc" \
    -DCMAKE_C_FLAGS="-g" \
    -DCMAKE_CXX_COMPILER="mpicxx" \
    -DCMAKE_CXX_FLAGS="-g -std=c++11 -Wl,--no-as-needed -ldl -pthread" \
    -DENABLE_ZOLTAN=OFF \
    -DENABLE_PARMETIS=ON \
    -DSCOREC_PREFIX=${core} \
    -DLONG_LOCAL_INDICES=ON \
    -DENABLE_PUMI=ON \
    -DIS_TESTING=OFF \
    -DCMAKE_INSTALL_PREFIX=$PWD/install
    kjansen@pfe25:~/buildengpar> ../engpar/doconfig.sh ~/engpar/ /home5/kjansen/SCOREC-core/buildMT_2/install/
-- The CXX compiler identification is GNU 6.2.0
-- The C compiler identification is GNU 6.2.0
-- Check for working CXX compiler: /nasa/hpe/mpt/2.17r13/bin/mpicxx
-- Check for working CXX compiler: /nasa/hpe/mpt/2.17r13/bin/mpicxx -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /nasa/hpe/mpt/2.17r13/bin/mpicc
-- Check for working C compiler: /nasa/hpe/mpt/2.17r13/bin/mpicc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- CMAKE_VERSION: 3.5.2
-- EnGPar_VERSION: 1.1.0
-- BUILD_TESTING: OFF
-- CMAKE_INSTALL_PREFIX: /home5/kjansen/buildengpar/install
-- Could NOT find Doxygen (missing:  DOXYGEN_EXECUTABLE) 
-- IS_TESTING: OFF
-- CXX compilation command: g++ -I/nasa/hpe/mpt/2.17r13/include -lpthread /usr/lib64/libcpuset.so.1 /usr/lib64/libbitmask.so.1 -L/nasa/hpe/mpt/2.17r13/lib -lmpi++ -lmpi

-- MPIRUN: MPIRUN-NOTFOUND -np
-- ENGPAR_FORTRAN_INTERFACE: OFF
**-- Local indices are 64 bytes**
-- ENABLE_PARMETIS: ON
-- ENABLE_ZOLTAN: OFF
-- Found PARMETIS: /home5/kjansen/Utilities/parmetis/parmetis-4.0.3/installGnuMpt/lib/libparmetis.a  
-- Configuring done
-- Generating done
-- Build files have been written to: /home5/kjansen/buildengpar

Unfortunately, this did not fix the problem (or did it....I don't see negatives)

PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> module load gcc/6.2
PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> module load mpi-sgi/mpt
PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> mpiexec -n 120 ~/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 
ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6
mesh outMesh/ loaded in 5.491756 seconds
number of tet 18685187 hex 0 prism 9682240 pyramid 0
mesh entity counts: v 8009471 e 41503118 f 61787347 r 28367427
planned Zoltan split factor 2 to target imbalance 1.100000 in 3.146424 seconds
mesh expanded from 60 to 120 parts in 1.256287 seconds
mesh migrated from 60 to 120 in 11.700151 seconds

PARMA_STATUS  disconnected <max avg> 0 0.000
PARMA_STATUS  neighbors <max avg> 19 9.983
PARMA_STATUS  smallest side of max neighbor part 29
PARMA_STATUS  num parts with max neighbors 1
PARMA_STATUS  empty parts 0
PARMA_STATUS  small neighbor counts 1:0 2:2 3:0 4:0 5:0 6:2 7:0 8:0 9:2 10:0 
PARMA_STATUS  weighted vtx <tot max min avg> 8682895.0 112373.0 34920.0 72357.458
PARMA_STATUS  weighted edge <tot max min avg> 43045935.0 435313.0 229585.0 358716.125
PARMA_STATUS  weighted face <tot max min avg> 62730587.0 589500.0 355227.0 522754.892
PARMA_STATUS  weighted rgn <tot max min avg> 28367427.0 290030.0 142546.0 236395.225
PARMA_STATUS  owned bdry vtx <tot max min avg> 576843 12960 0 4807.025
PARMA_STATUS  shared bdry vtx <tot max min avg> 1176539 16528 1510 9804.492
PARMA_STATUS  model bdry vtx <tot max min avg> 266000 10172 0 2216.667
PARMA_STATUS  sharedSidesToElements <max min avg> 0.111 0.010 0.068
PARMA_STATUS  entity imbalance <v e f r>: 1.55 1.21 1.13 1.23

99 nlp 2122884056 nle 39737
59 nlp 1848095169 nle 50000
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
108 nlp 952788 nle 41193
51 nlp 927740 nle 39326
112 nlp 1046825 nle 45320
57 nlp 1177020 nle 51071
109 nlp 838136 nle 34920
92 nlp 1045943 nle 44215
93 nlp 1111342 nle 47043
118 nlp 1160873 nle 49411
33 nlp 977030 nle 41241
102 nlp 1192183 nle 51189
117 nlp 1082932 nle 45770
113 nlp 1070532 nle 45309
36 nlp 1061060 nle 44968
98 nlp 1240290 nle 53230
37 nlp 1110348 nle 46773
MPT ERROR: Rank 59(g:59) received signal SIGABRT/SIGIOT(6).
    Process ID: 48271, Host: r573i4n6, Program: /home5/kjansen/buildengpar/test/splitAndBalanceMesh
    MPT Version: HPE MPT 2.17  11/30/17 08:08:29

MPT: --------stack traceback-------
34 nlp 1147389 nle 48452
104 nlp 997048 nle 42346
101 nlp 1619436 nle 49541
105 nlp 1089882 nle 46158
43 nlp 1174608 nle 49490
54 nlp 1159270 nle 48648
45 nlp 1156956 nle 48946
52 nlp 1156365 nle 48880
27 nlp 1228810 nle 52584
90 nlp 1083250 nle 45724
119 nlp 1222668 nle 51540
47 nlp 1180836 nle 49838
58 nlp 1991683098 nle 52478
91 nlp 1162613 nle 100807
100 nlp 1231264 nle 52035
39 nlp 1229520 nle 51761
107 nlp 1133343 nle 47904
103 nlp 1221699 nle 51715
106 nlp 1097030 nle 92104
110 nlp 1140810 nle 48316
48 nlp 1204564 nle 50627
96 nlp 1047691 nle 71326
53 nlp 1238186 nle 52231
83 nlp 990491 nle 50397
55 nlp 1359121 nle 52562
42 nlp 1238392 nle 52403
46 nlp 1206098 nle 50625
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
115 nlp 1232744 nle 52366
56 nlp 1835271637 nle 52681
38 nlp 1261028 nle 52832
44 nlp 1262759 nle 53363
32 nlp 1150728 nle 48542
111 nlp 1124040 nle 46817
50 nlp 1231181 nle 51662
31 nlp 1247900 nle 52705
41 nlp 1229760 nle 51681
28 nlp 1253121 nle 52910
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
35 nlp 1267186 nle 53494
30 nlp 1254895 nle 52612
24 nlp 1217616 nle 50886
64 nlp 1256866 nle 75894
114 nlp 1265824 nle 53089
49 nlp 1260068 nle 52762
82 nlp 890976 nle 68804
17 nlp 1211961 nle 62727
29 nlp 1265666 nle 53055
78 nlp 923136 nle 77637
66 nlp 1097870 nle 93892
116 nlp 1266674 nle 53357
80 nlp 971814 nle 82007
40 nlp 1212876 nle 50810
15 nlp 1290996 nle 95015
84 nlp 1239423 nle 105484
26 nlp 1268241 nle 52986
61 nlp 1242092 nle 81230
77 nlp 1116858 nle 79411
69 nlp 1223016 nle 104134
75 nlp 1067598 nle 73259
60 nlp 1260873 nle 108003
73 nlp 1038174 nle 87581
97 nlp 1175911 nle 74222
4 nlp 1208871 nle 103201
18 nlp 1307321 nle 96492
8 nlp 1203720 nle 102109
95 nlp 1265598 nle 108429
68 nlp 1268243 nle 82514
86 nlp 1124802 nle 94993
88 nlp 1283457 nle 90131
22 nlp 1282856 nle 70829
21 nlp 1297034 nle 91652
25 nlp 1300891 nle 74589
63 nlp 1293464 nle 84892
74 nlp 1299909 nle 110652
13 nlp 1263841 nle 105356
72 nlp 1243826 nle 81195
89 nlp 1295568 nle 110380
19 nlp 1289680 nle 89270
79 nlp 1252092 nle 106791
3 nlp 1227291 nle 104297
6 nlp 1276641 nle 108635
70 nlp 1161988 nle 97269
81 nlp 1314015 nle 102439
76 nlp 1221766 nle 95822
85 nlp 1256709 nle 106822
2 nlp 1317027 nle 112373
1 nlp 1309599 nle 111764
9 nlp 1261650 nle 106179
65 nlp 1251204 nle 106011
94 nlp 1321065 nle 111892
20 nlp 1310938 nle 92005
23 nlp 1310037 nle 88603
10 nlp 1307088 nle 110946
12 nlp 1210623 nle 95865
62 nlp 1227882 nle 88072
67 nlp 1228118 nle 94014
71 nlp 1315848 nle 110669
14 nlp 1276752 nle 103633
5 nlp 1325682 nle 111950
16 nlp 1331516 nle 96747
11 nlp 1275648 nle 106917
87 nlp 1343556 nle 111219
7 nlp 1304574 nle 109850
0 nlp 1319316 nle 107363
MPT: Attaching to program: /proc/48271/exe, process 48271
MPT: (No debugging symbols found in /lib64/libdl.so.2)
MPT: (No debugging symbols found in /home5/kjansen/Utilities/simModSuite/14.0-180813dev/lib/x64_rhel7_gcc48/psKrnl/libpskernel.so)
MPT: (No debugging symbols found in /usr/lib64/libbz2.so.1)
MPT: (No debugging symbols found in /lib64/libpthread.so.0)
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (No debugging symbols found in /usr/lib64/libcpuset.so.1)
MPT: (No debugging symbols found in /usr/lib64/libbitmask.so.1)
MPT: (No debugging symbols found in /lib64/libm.so.6)
MPT: (No debugging symbols found in /lib64/libc.so.6)
MPT: (No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
MPT: (No debugging symbols found in /lib64/librt.so.1)
MPT: (No debugging symbols found in /usr/lib64/libibverbs.so.1)
MPT: (No debugging symbols found in /usr/lib64/libnl-route-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libnl-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libmthca-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx5-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx4-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libcxgb3-rdmav2.so)
MPT: 0x00002aaaaeeae7da in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/nasa/pkgsrc/sles12/2016Q4/gcc6/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
MPT: To enable execution of this file add
MPT:    add-auto-load-safe-path /nasa/pkgsrc/sles12/2016Q4/gcc6/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: To completely disable this security protection add
MPT:    set auto-load safe-path /
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
MPT:    info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-109.2.x86_64 libbz2-1-debuginfo-1.0.6-30.8.1.x86_64 libcxgb3-rdmav2-debuginfo-1.3.1-6.2.x86_64 libibverbs-debuginfo-41mlnx1-OFED.4.9.0.0.7.49017.x86_64 libmlx4-debuginfo-41mlnx1-OFED.4.7.3.0.3.49017.x86_64 libmlx5-debuginfo-41mlnx1-OFED.4.9.0.1.2.49017.x86_64 libmthca-rdmav2-debuginfo-1.0.6-5.2.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64
MPT: (gdb) #0  0x00002aaaaeeae7da in waitpid () from /lib64/libpthread.so.0
MPT: #1  0x00002aaaaf811906 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7fffffffcf40 "MPT ERROR: Rank 59(g:59) received signal SIGABRT/SIGIOT(6).\n\tProcess ID: 48271, Host: r573i4n6, Program: /home5/kjansen/buildengpar/test/splitAndBalanceMesh\n\tMPT Version: HPE MPT 2.17  11/30/17 08:08:"...) at sig.c:339
MPT: #3  0x00002aaaaf811b08 in first_arriver_handler (signo=signo@entry=6, 
MPT:     stack_trace_sem=stack_trace_sem@entry=0x2aaab6860080) at sig.c:488
MPT: #4  0x00002aaaaf811eeb in slave_sig_handler (signo=6, siginfo=<optimized out>, 
MPT: 

(edit: formatting)

cwsmith commented 3 years ago

There are still a couple really large values:

99 nlp 2122884056 nle 39737
59 nlp 1848095169 nle 50000
58 nlp 1991683098 nle 52478
KennethEJansen commented 3 years ago

yep. Hard to imagine that this is not a bug. How does an 8M node mesh blow out of 64 bit ints.

KennethEJansen commented 3 years ago

I will see if I can get this built with memory sanitizer on the viz nodes and see if I can make it fail going from 8 to 16 parts.

KennethEJansen commented 3 years ago

kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ ls debug0.txt debug4.txt geom3D.cnn rendered STFM_MT_Coarsened_6_15.class debug1.txt debug5.txt outlog.8part run.sh STFM_MT_Coarsened_6_15.crd debug2.txt debug6.txt outMesh STFM_MT_Coarsened_6_15.2DLay STFM_MT_Coarsened_6_15.fathers2D debug3.txt debug7.txt outModel.dmg STFM_MT_Coarsened_6_15.3DLay STFM_MT_Coarsened_6_15.match kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ ls outMesh/ 0.smb 1.smb 2.smb 3.smb 4.smb 5.smb 6.smb 7.smb kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ mpirun -np 16 /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6 mesh outMesh/ loaded in 160.162255 seconds number of tet 18685187 hex 0 prism 9682240 pyramid 0 mesh entity counts: v 8009471 e 41503118 f 61787347 r 28367427 planned Zoltan split factor 2 to target imbalance 1.100000 in 156.480378 seconds mesh expanded from 8 to 16 parts in 162.563581 seconds mesh migrated from 8 to 16 in 1342.703543 seconds

PARMA_STATUS disconnected 0 0.000 PARMA_STATUS neighbors 9 5.750 PARMA_STATUS smallest side of max neighbor part 263 PARMA_STATUS num parts with max neighbors 1 PARMA_STATUS empty parts 0 PARMA_STATUS small neighbor counts 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 PARMA_STATUS weighted vtx 8291314.0 740133.0 314254.0 518207.125 PARMA_STATUS weighted edge 42039662.0 3098407.0 2023751.0 2627478.875 PARMA_STATUS weighted face 62115791.0 4421966.0 2694395.0 3882236.938 PARMA_STATUS weighted rgn 28367427.0 2195205.0 1139129.0 1772964.188 PARMA_STATUS owned bdry vtx 205450 27257 0 12840.625 PARMA_STATUS shared bdry vtx 413565 41876 14418 25847.812 PARMA_STATUS model bdry vtx 258894 26553 6168 16180.875 PARMA_STATUS sharedSidesToElements 0.037 0.014 0.024 PARMA_STATUS entity imbalance : 1.43 1.18 1.14 1.24

12 nlp 7352372 nle 314254 5 nlp 8081439 nle 343794 7 nlp 2113618899 nle 361797 terminate called after throwing an instance of 'std::bad_array_new_length' what(): std::bad_array_new_length [viz003:14126] Process received signal [viz003:14126] Signal: Aborted (6) [viz003:14126] Signal code: (-6) 13 nlp 8939591 nle 381961 [viz003:14126] [ 0] /usr/local/openmpi/1.10.6-gnu49-thread/lib/libopen-pal.so.13(+0x5bf0a)[0x7f228cb69f0a] [viz003:14126] [ 1] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f228d634890] [viz003:14126] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f228d2af067] [viz003:14126] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f228d2b0448] [viz003:14126] [ 4] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f228dde957d] [viz003:14126] [ 5] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e556)[0x7f228dde7556] [viz003:14126] [ 6] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e5a1)[0x7f228dde75a1] [viz003:14126] [ 7] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e7b8)[0x7f228dde77b8] [viz003:14126] [ 8] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8d542)[0x7f228dde6542] [viz003:14126] [ 9] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraph13connectToPinsEiii+0xe2b)[0x547f41] [viz003:14126] [10] 15 nlp 7518688 nle 317183 /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraphC1EPN3apf4MeshEPKcii+0x274)[0x5449b6] [viz003:14126] [11] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi14createAPFGraphEPN3apf4MeshEPKcii+0x10e)[0x54448e] [viz003:14126] [12] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_Z9splitMeshRPN3agi6NgraphERPN3apf5Mesh2EiPPc+0x212)[0x52f1e9] [viz003:14126] [13] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(main+0x2d9)[0x52db7f] [viz003:14126] [14] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5)[0x7f228d29bb45] [viz003:14126] [15] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh[0x52d7d3] [viz003:14126] End of error message 6 nlp 2114268903 nle 376925 terminate called after throwing an instance of 'std::bad_array_new_length' what(): std::bad_array_new_length [viz003:14125] Process received signal [viz003:14125] Signal: Aborted (6) [viz003:14125] Associated errno: Unknown error 32766 (32766) [viz003:14125] Signal code: User function (kill, sigsend, abort, etc.) (0) [viz003:14125] [ 0] /usr/local/openmpi/1.10.6-gnu49-thread/lib/libopen-pal.so.13(+0x5bf0a)[0x7f22e5aebf0a] [viz003:14125] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f22e62310e0] [viz003:14125] [ 2] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f22e6231067] [viz003:14125] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f22e6232448] [viz003:14125] [ 4] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x7f22e6d6b57d] [viz003:14125] [ 5] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e556)[0x7f22e6d69556] [viz003:14125] [ 6] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e5a1)[0x7f22e6d695a1] [viz003:14125] [ 7] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e7b8)[0x7f22e6d697b8] [viz003:14125] [ 8] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8d542)[0x7f22e6d68542] [viz003:14125] [ 9] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraph13connectToPinsEiii+0xe2b)[0x547f41] [viz003:14125] [10] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraphC1EPN3apf4MeshEPKcii+0x274)[0x5449b6] [viz003:14125] [11] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi14createAPFGraphEPN3apf4MeshEPKcii+0x10e)[0x54448e] [viz003:14125] [12] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_Z9splitMeshRPN3agi6NgraphERPN3apf5Mesh2EiPPc+0x212)[0x52f1e9] [viz003:14125] [13] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(main+0x2d9)[0x52db7f] [viz003:14125] [14] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5)[0x7f22e621db45] [viz003:14125] [15] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh[0x52d7d3] [viz003:14125] End of error message 4 nlp 9064349 nle 383166 14 nlp 6318221 nle 468486 10 nlp 8369562 nle 628787 8 nlp 8284026 nle 595566 9 nlp 9430550 nle 708127 11 nlp 9248040 nle 697600 1 nlp 9534642 nle 740133 3 nlp 9223696 nle 675492 0 nlp 9166728 nle 705247 2 nlp 8635808 nle 592796

mpirun noticed that process rank 7 with PID 14126 on node viz003 exited on signal 6 (Aborted).

This was with memory sanitizer and it is not reporting any issues

KennethEJansen commented 3 years ago

And the tiny case also crashes going from 1 to 2

kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-tiny/MGEN/mner $ mpirun -np 2 /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6 mesh outMesh/ loaded in 7.193293 seconds number of tet 392910 hex 0 prism 35720 pyramid 0 mesh entity counts: v 85800 e 557713 f 893140 r 428630 planned Zoltan split factor 2 to target imbalance 1.100000 in 8.085761 seconds mesh expanded from 1 to 2 parts in 7.052858 seconds mesh migrated from 1 to 2 in 37.440213 seconds

PARMA_STATUS disconnected 0 0.000 PARMA_STATUS neighbors 1 1.000 PARMA_STATUS smallest side of max neighbor part 556 PARMA_STATUS num parts with max neighbors 2 PARMA_STATUS empty parts 0 PARMA_STATUS small neighbor counts 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 PARMA_STATUS weighted vtx 93760.0 51793.0 41967.0 46880.000 PARMA_STATUS weighted edge 559257.0 285988.0 273269.0 279628.500 PARMA_STATUS weighted face 894129.0 452760.0 441369.0 447064.500 PARMA_STATUS weighted rgn 428630.0 221457.0 207173.0 214315.000 PARMA_STATUS owned bdry vtx 556 556 0 278.000 PARMA_STATUS shared bdry vtx 1112 556 556 556.000 PARMA_STATUS model bdry vtx 19416 10003 9413 9708.000 PARMA_STATUS sharedSidesToElements 0.005 0.004 0.005 PARMA_STATUS entity imbalance : 1.10 1.02 1.01 1.03

1 nlp 506138444 nle 41967 0 nlp 1600956010 nle 51793 terminate called after throwing an instance of 'std::bad_array_new_length' what(): std::bad_array_new_length terminate called after throwing an instance of 'std::bad_array_new_length' [viz003:17835] Process received signal what(): std::bad_array_new_length [viz003:17835] Signal: Aborted (6) [viz003:17835] Signal code: (-6) [viz003:17834] Process received signal [viz003:17834] Signal: Aborted (6) [viz003:17834] Signal code: User function (kill, sigsend, abort, etc.) (0) [viz003:17835] [ 0] [viz003:17834] [ 0] /usr/local/openmpi/1.10.6-gnu49-thread/lib/libopen-pal.so.13(+0x5bf0a)[0x7f651d5f6f0a] [viz003:17835] [ 1] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf890)[0x7f651e0c1890] [viz003:17835] [ 2] /usr/local/openmpi/1.10.6-gnu49-thread/lib/libopen-pal.so.13(+0x5bf0a)[0x7fe3513a4f0a] [viz003:17834] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f651dd3c067] [viz003:17835] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7fe351aea0e0] [viz003:17834] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f651dd3d448] [viz003:17835] [ 4] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fe351aea067] [viz003:17834] [ 3] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0x15d)[0x7f651e87657d] [viz003:17835] [ 5] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fe351aeb448] [viz003:17834] [ 4] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e556)[0x7f651e874556] [viz003:17835] [ 6] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(_ZN9gnu_cxx27verbose_terminate_handlerEv+0x15d)[0x7fe35262457d] [viz003:17834] [ 5] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e5a1)[0x7f651e8745a1] [viz003:17835] [ 7] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e7b8)[0x7f651e8747b8] [viz003:17835] [ 8] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e556)[0x7fe352622556] [viz003:17834] [ 6] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e5a1)[0x7fe3526225a1] [viz003:17834] [ 7] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8d542)[0x7f651e873542] [viz003:17835] [ 9] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8e7b8)[0x7fe3526227b8] [viz003:17834] [ 8] /usr/local/gcc/6.3.0/lib64/libstdc++.so.6(+0x8d542)[0x7fe352621542] [viz003:17834] [ 9] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraph13connectToPinsEiii+0xe2b)[0x547f41] [viz003:17835] [10] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraph13connectToPinsEiii+0xe2b)[0x547f41] [viz003:17834] [10] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraphC1EPN3apf4MeshEPKcii+0x274)[0x5449b6] [viz003:17835] [11] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi14createAPFGraphEPN3apf4MeshEPKcii+0x10e)[0x54448e] [viz003:17835] [12] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi8apfGraphC1EPN3apf4MeshEPKcii+0x274)[0x5449b6] [viz003:17834] [11] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_Z9splitMeshRPN3agi6NgraphERPN3apf5Mesh2EiPPc+0x212)[0x52f1e9] [viz003:17835] [13] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_ZN3agi14createAPFGraphEPN3apf4MeshEPKcii+0x10e)[0x54448e] [viz003:17834] [12] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(main+0x2d9)[0x52db7f] [viz003:17835] [14] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5)[0x7f651dd28b45] [viz003:17835] [15] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(_Z9splitMeshRPN3agi6NgraphERPN3apf5Mesh2EiPPc+0x212)[0x52f1e9] [viz003:17834] [13] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh[0x52d7d3] [viz003:17835] End of error message /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh(main+0x2d9)[0x52db7f] [viz003:17834] [14] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5)[0x7fe351ad6b45] [viz003:17834] [15] /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh[0x52d7d3] [viz003:17834] End of error message

mpirun noticed that process rank 1 with PID 17835 on node viz003 exited on signal 6 (Aborted).

diamog commented 3 years ago

I did some sanity checks and did not find anything. Can I get a copy of the smallest failing case so that I can explore the issue further?

KennethEJansen commented 3 years ago

Hopefully Cameron can transfer the files to a machine you have access to.

This has become non-urgent but still interesting because I have improved (fixed bug in weight setting) of chef and am now able to get 9% node imbalance from chef with graph partitioning out to 16320 processes.