Open cwsmith opened 3 years ago
My first intuition is telling me the problem is an integer overflow. This line is what is troubling me along with the negatives:
101 nlp 1915291810 nle 49541
With the 55% vertex imbalance, I am assuming there are a few parts with a lot of vertices and those vertices are bounding a lot of elements (likely made worse by the mixed elements).
You can try running with the local id type as a 64-bit int and see if that fixes the problem.
I would expect this on the target case (278M verts) but this case is only 8M verts so this would be surprising. I will see if I can figure out how to rebuild the code with 64 bit int.
-DMDS_SET_MAX=1024 \ -DMDS_ID_TYPE=long \
robbed these from chef....is either of these what I am looking for?
Hacking around I think I found the right flag:
kjansen@pfe25:~/buildengpar> more ../engpar/doconfig.sh
#!/bin/bash
[ $# -ne 2 ] && echo "Usage: $0 <absolute path to source> <scorec core install>"
&& exit 1
src=$1
[ ! -e $src ] && echo "source dir $src does not exist!" && exit 1
core=$2
[ ! -e $core ] && echo "SCOREC core dir $core does not exist!" && exit 1
cmake $src \
-DCMAKE_C_COMPILER="mpicc" \
-DCMAKE_C_FLAGS="-g" \
-DCMAKE_CXX_COMPILER="mpicxx" \
-DCMAKE_CXX_FLAGS="-g -std=c++11 -Wl,--no-as-needed -ldl -pthread" \
-DENABLE_ZOLTAN=OFF \
-DENABLE_PARMETIS=ON \
-DSCOREC_PREFIX=${core} \
-DLONG_LOCAL_INDICES=ON \
-DENABLE_PUMI=ON \
-DIS_TESTING=OFF \
-DCMAKE_INSTALL_PREFIX=$PWD/install
kjansen@pfe25:~/buildengpar> ../engpar/doconfig.sh ~/engpar/ /home5/kjansen/SCOREC-core/buildMT_2/install/
-- The CXX compiler identification is GNU 6.2.0
-- The C compiler identification is GNU 6.2.0
-- Check for working CXX compiler: /nasa/hpe/mpt/2.17r13/bin/mpicxx
-- Check for working CXX compiler: /nasa/hpe/mpt/2.17r13/bin/mpicxx -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working C compiler: /nasa/hpe/mpt/2.17r13/bin/mpicc
-- Check for working C compiler: /nasa/hpe/mpt/2.17r13/bin/mpicc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- CMAKE_VERSION: 3.5.2
-- EnGPar_VERSION: 1.1.0
-- BUILD_TESTING: OFF
-- CMAKE_INSTALL_PREFIX: /home5/kjansen/buildengpar/install
-- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
-- IS_TESTING: OFF
-- CXX compilation command: g++ -I/nasa/hpe/mpt/2.17r13/include -lpthread /usr/lib64/libcpuset.so.1 /usr/lib64/libbitmask.so.1 -L/nasa/hpe/mpt/2.17r13/lib -lmpi++ -lmpi
-- MPIRUN: MPIRUN-NOTFOUND -np
-- ENGPAR_FORTRAN_INTERFACE: OFF
**-- Local indices are 64 bytes**
-- ENABLE_PARMETIS: ON
-- ENABLE_ZOLTAN: OFF
-- Found PARMETIS: /home5/kjansen/Utilities/parmetis/parmetis-4.0.3/installGnuMpt/lib/libparmetis.a
-- Configuring done
-- Generating done
-- Build files have been written to: /home5/kjansen/buildengpar
Unfortunately, this did not fix the problem (or did it....I don't see negatives)
PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> module load gcc/6.2
PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> module load mpi-sgi/mpt
PBS r573i3n12:/nobackup/kjansen/SeparatedBump/FlatPlateBumpDimensions/FPS-MTW-6-15/mner> mpiexec -n 120 ~/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2
ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6
mesh outMesh/ loaded in 5.491756 seconds
number of tet 18685187 hex 0 prism 9682240 pyramid 0
mesh entity counts: v 8009471 e 41503118 f 61787347 r 28367427
planned Zoltan split factor 2 to target imbalance 1.100000 in 3.146424 seconds
mesh expanded from 60 to 120 parts in 1.256287 seconds
mesh migrated from 60 to 120 in 11.700151 seconds
PARMA_STATUS disconnected <max avg> 0 0.000
PARMA_STATUS neighbors <max avg> 19 9.983
PARMA_STATUS smallest side of max neighbor part 29
PARMA_STATUS num parts with max neighbors 1
PARMA_STATUS empty parts 0
PARMA_STATUS small neighbor counts 1:0 2:2 3:0 4:0 5:0 6:2 7:0 8:0 9:2 10:0
PARMA_STATUS weighted vtx <tot max min avg> 8682895.0 112373.0 34920.0 72357.458
PARMA_STATUS weighted edge <tot max min avg> 43045935.0 435313.0 229585.0 358716.125
PARMA_STATUS weighted face <tot max min avg> 62730587.0 589500.0 355227.0 522754.892
PARMA_STATUS weighted rgn <tot max min avg> 28367427.0 290030.0 142546.0 236395.225
PARMA_STATUS owned bdry vtx <tot max min avg> 576843 12960 0 4807.025
PARMA_STATUS shared bdry vtx <tot max min avg> 1176539 16528 1510 9804.492
PARMA_STATUS model bdry vtx <tot max min avg> 266000 10172 0 2216.667
PARMA_STATUS sharedSidesToElements <max min avg> 0.111 0.010 0.068
PARMA_STATUS entity imbalance <v e f r>: 1.55 1.21 1.13 1.23
99 nlp 2122884056 nle 39737
59 nlp 1848095169 nle 50000
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
108 nlp 952788 nle 41193
51 nlp 927740 nle 39326
112 nlp 1046825 nle 45320
57 nlp 1177020 nle 51071
109 nlp 838136 nle 34920
92 nlp 1045943 nle 44215
93 nlp 1111342 nle 47043
118 nlp 1160873 nle 49411
33 nlp 977030 nle 41241
102 nlp 1192183 nle 51189
117 nlp 1082932 nle 45770
113 nlp 1070532 nle 45309
36 nlp 1061060 nle 44968
98 nlp 1240290 nle 53230
37 nlp 1110348 nle 46773
MPT ERROR: Rank 59(g:59) received signal SIGABRT/SIGIOT(6).
Process ID: 48271, Host: r573i4n6, Program: /home5/kjansen/buildengpar/test/splitAndBalanceMesh
MPT Version: HPE MPT 2.17 11/30/17 08:08:29
MPT: --------stack traceback-------
34 nlp 1147389 nle 48452
104 nlp 997048 nle 42346
101 nlp 1619436 nle 49541
105 nlp 1089882 nle 46158
43 nlp 1174608 nle 49490
54 nlp 1159270 nle 48648
45 nlp 1156956 nle 48946
52 nlp 1156365 nle 48880
27 nlp 1228810 nle 52584
90 nlp 1083250 nle 45724
119 nlp 1222668 nle 51540
47 nlp 1180836 nle 49838
58 nlp 1991683098 nle 52478
91 nlp 1162613 nle 100807
100 nlp 1231264 nle 52035
39 nlp 1229520 nle 51761
107 nlp 1133343 nle 47904
103 nlp 1221699 nle 51715
106 nlp 1097030 nle 92104
110 nlp 1140810 nle 48316
48 nlp 1204564 nle 50627
96 nlp 1047691 nle 71326
53 nlp 1238186 nle 52231
83 nlp 990491 nle 50397
55 nlp 1359121 nle 52562
42 nlp 1238392 nle 52403
46 nlp 1206098 nle 50625
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
115 nlp 1232744 nle 52366
56 nlp 1835271637 nle 52681
38 nlp 1261028 nle 52832
44 nlp 1262759 nle 53363
32 nlp 1150728 nle 48542
111 nlp 1124040 nle 46817
50 nlp 1231181 nle 51662
31 nlp 1247900 nle 52705
41 nlp 1229760 nle 51681
28 nlp 1253121 nle 52910
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
35 nlp 1267186 nle 53494
30 nlp 1254895 nle 52612
24 nlp 1217616 nle 50886
64 nlp 1256866 nle 75894
114 nlp 1265824 nle 53089
49 nlp 1260068 nle 52762
82 nlp 890976 nle 68804
17 nlp 1211961 nle 62727
29 nlp 1265666 nle 53055
78 nlp 923136 nle 77637
66 nlp 1097870 nle 93892
116 nlp 1266674 nle 53357
80 nlp 971814 nle 82007
40 nlp 1212876 nle 50810
15 nlp 1290996 nle 95015
84 nlp 1239423 nle 105484
26 nlp 1268241 nle 52986
61 nlp 1242092 nle 81230
77 nlp 1116858 nle 79411
69 nlp 1223016 nle 104134
75 nlp 1067598 nle 73259
60 nlp 1260873 nle 108003
73 nlp 1038174 nle 87581
97 nlp 1175911 nle 74222
4 nlp 1208871 nle 103201
18 nlp 1307321 nle 96492
8 nlp 1203720 nle 102109
95 nlp 1265598 nle 108429
68 nlp 1268243 nle 82514
86 nlp 1124802 nle 94993
88 nlp 1283457 nle 90131
22 nlp 1282856 nle 70829
21 nlp 1297034 nle 91652
25 nlp 1300891 nle 74589
63 nlp 1293464 nle 84892
74 nlp 1299909 nle 110652
13 nlp 1263841 nle 105356
72 nlp 1243826 nle 81195
89 nlp 1295568 nle 110380
19 nlp 1289680 nle 89270
79 nlp 1252092 nle 106791
3 nlp 1227291 nle 104297
6 nlp 1276641 nle 108635
70 nlp 1161988 nle 97269
81 nlp 1314015 nle 102439
76 nlp 1221766 nle 95822
85 nlp 1256709 nle 106822
2 nlp 1317027 nle 112373
1 nlp 1309599 nle 111764
9 nlp 1261650 nle 106179
65 nlp 1251204 nle 106011
94 nlp 1321065 nle 111892
20 nlp 1310938 nle 92005
23 nlp 1310037 nle 88603
10 nlp 1307088 nle 110946
12 nlp 1210623 nle 95865
62 nlp 1227882 nle 88072
67 nlp 1228118 nle 94014
71 nlp 1315848 nle 110669
14 nlp 1276752 nle 103633
5 nlp 1325682 nle 111950
16 nlp 1331516 nle 96747
11 nlp 1275648 nle 106917
87 nlp 1343556 nle 111219
7 nlp 1304574 nle 109850
0 nlp 1319316 nle 107363
MPT: Attaching to program: /proc/48271/exe, process 48271
MPT: (No debugging symbols found in /lib64/libdl.so.2)
MPT: (No debugging symbols found in /home5/kjansen/Utilities/simModSuite/14.0-180813dev/lib/x64_rhel7_gcc48/psKrnl/libpskernel.so)
MPT: (No debugging symbols found in /usr/lib64/libbz2.so.1)
MPT: (No debugging symbols found in /lib64/libpthread.so.0)
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: (No debugging symbols found in /usr/lib64/libcpuset.so.1)
MPT: (No debugging symbols found in /usr/lib64/libbitmask.so.1)
MPT: (No debugging symbols found in /lib64/libm.so.6)
MPT: (No debugging symbols found in /lib64/libc.so.6)
MPT: (No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
MPT: (No debugging symbols found in /lib64/librt.so.1)
MPT: (No debugging symbols found in /usr/lib64/libibverbs.so.1)
MPT: (No debugging symbols found in /usr/lib64/libnl-route-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libnl-3.so.200)
MPT: (No debugging symbols found in /usr/lib64/libmthca-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx5-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libmlx4-rdmav2.so)
MPT: (No debugging symbols found in /usr/lib64/libcxgb3-rdmav2.so)
MPT: 0x00002aaaaeeae7da in waitpid () from /lib64/libpthread.so.0
MPT: warning: File "/nasa/pkgsrc/sles12/2016Q4/gcc6/lib64/libstdc++.so.6.0.22-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
MPT: To enable execution of this file add
MPT: add-auto-load-safe-path /nasa/pkgsrc/sles12/2016Q4/gcc6/lib64/libstdc++.so.6.0.22-gdb.py
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: To completely disable this security protection add
MPT: set auto-load safe-path /
MPT: line to your configuration file "/home5/kjansen/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
MPT: info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-109.2.x86_64 libbz2-1-debuginfo-1.0.6-30.8.1.x86_64 libcxgb3-rdmav2-debuginfo-1.3.1-6.2.x86_64 libibverbs-debuginfo-41mlnx1-OFED.4.9.0.0.7.49017.x86_64 libmlx4-debuginfo-41mlnx1-OFED.4.7.3.0.3.49017.x86_64 libmlx5-debuginfo-41mlnx1-OFED.4.9.0.1.2.49017.x86_64 libmthca-rdmav2-debuginfo-1.0.6-5.2.x86_64 libnl3-200-debuginfo-3.2.23-2.21.x86_64
MPT: (gdb) #0 0x00002aaaaeeae7da in waitpid () from /lib64/libpthread.so.0
MPT: #1 0x00002aaaaf811906 in mpi_sgi_system (
MPT: #2 MPI_SGI_stacktraceback (
MPT: header=header@entry=0x7fffffffcf40 "MPT ERROR: Rank 59(g:59) received signal SIGABRT/SIGIOT(6).\n\tProcess ID: 48271, Host: r573i4n6, Program: /home5/kjansen/buildengpar/test/splitAndBalanceMesh\n\tMPT Version: HPE MPT 2.17 11/30/17 08:08:"...) at sig.c:339
MPT: #3 0x00002aaaaf811b08 in first_arriver_handler (signo=signo@entry=6,
MPT: stack_trace_sem=stack_trace_sem@entry=0x2aaab6860080) at sig.c:488
MPT: #4 0x00002aaaaf811eeb in slave_sig_handler (signo=6, siginfo=<optimized out>,
MPT:
(edit: formatting)
There are still a couple really large values:
99 nlp 2122884056 nle 39737
59 nlp 1848095169 nle 50000
58 nlp 1991683098 nle 52478
yep. Hard to imagine that this is not a bug. How does an 8M node mesh blow out of 64 bit ints.
I will see if I can get this built with memory sanitizer on the viz nodes and see if I can make it fail going from 8 to 16 parts.
kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ ls debug0.txt debug4.txt geom3D.cnn rendered STFM_MT_Coarsened_6_15.class debug1.txt debug5.txt outlog.8part run.sh STFM_MT_Coarsened_6_15.crd debug2.txt debug6.txt outMesh STFM_MT_Coarsened_6_15.2DLay STFM_MT_Coarsened_6_15.fathers2D debug3.txt debug7.txt outModel.dmg STFM_MT_Coarsened_6_15.3DLay STFM_MT_Coarsened_6_15.match kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ ls outMesh/ 0.smb 1.smb 2.smb 3.smb 4.smb 5.smb 6.smb 7.smb kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN/mner $ mpirun -np 16 /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6 mesh outMesh/ loaded in 160.162255 seconds number of tet 18685187 hex 0 prism 9682240 pyramid 0 mesh entity counts: v 8009471 e 41503118 f 61787347 r 28367427 planned Zoltan split factor 2 to target imbalance 1.100000 in 156.480378 seconds mesh expanded from 8 to 16 parts in 162.563581 seconds mesh migrated from 8 to 16 in 1342.703543 seconds
PARMA_STATUS disconnected
mpirun noticed that process rank 7 with PID 14126 on node viz003 exited on signal 6 (Aborted).
This was with memory sanitizer and it is not reporting any issues
And the tiny case also crashes going from 1 to 2
kjansen@viz003: /projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-tiny/MGEN/mner $ mpirun -np 2 /projects/tools/EngPar/buildengpar/test/splitAndBalanceMesh outModel.dmg outMesh/ 0 2 ENGPAR Git hash 68853710c2dc936006d0eb9446434724f8a21ce6 mesh outMesh/ loaded in 7.193293 seconds number of tet 392910 hex 0 prism 35720 pyramid 0 mesh entity counts: v 85800 e 557713 f 893140 r 428630 planned Zoltan split factor 2 to target imbalance 1.100000 in 8.085761 seconds mesh expanded from 1 to 2 parts in 7.052858 seconds mesh migrated from 1 to 2 in 37.440213 seconds
PARMA_STATUS disconnected
mpirun noticed that process rank 1 with PID 17835 on node viz003 exited on signal 6 (Aborted).
I did some sanity checks and did not find anything. Can I get a copy of the smallest failing case so that I can explore the issue further?
Hopefully Cameron can transfer the files to a machine you have access to.
This has become non-urgent but still interesting because I have improved (fixed bug in weight setting) of chef and am now able to get 9% node imbalance from chef with graph partitioning out to 16320 processes.
After splitting, during hypergraph creation (mesh elements are graph nodes, mesh vertices -> hyperedges), the pin connection process segfaults with the a bad allocation here:
https://github.com/SCOREC/EnGPar/blob/68853710c2dc936006d0eb9446434724f8a21ce6/interfaces/apfGraph.cpp#L293
The following print statement was added before that line to help determine the cause:
fprintf(stderr, "%d nlp %d nle %d\n", PCU_Comm_Self(), nlp, nle);
The output from a run on Pleiades is pasted below. Two ranks print large negative values for
nlp
:input mesh and model
/projects/tools/Models/BoeingBump/LES_DNS_Meshing/FPS-MTW-6-15/MGEN
stdout/err