Closed Damilare06 closed 4 years ago
@Damilare06 I think you may try print statement again, because the line you spotted here is not the latest commit(if it's spotted right). That spotted line as well as the two lines above is not right, because a dynamically allocated pointer is used to point to another allocated pointer. But if Valgrind works on our codes, the debug burden would be hugely relieved in the future.
@Damilare06 I think you may try print statement again, because the line you spotted here is not the latest commit(if it's spotted right). That spotted line as well as the two lines above is not right, because a dynamically allocated pointer is used to point to another allocated pointer. But if Valgrind works on our codes, the debug burden would be hugely relieved in the future.
Yes Shuangxi, I referenced those old locations from previous commits to show the solution logic and to better document the approach. Right now Valgrind
works in identifying some errors such as https://github.com/SCOREC/wdmapp_coupling/issues/86
At the moment, Cameron has advised I run an adios2 simple test on Summit with Valgrind to check if the same errors exist
Can valgrind be applicable to WDMAPP without coupler? If they can, changing Sst to BP4 looks not useful.
to wdmapp? I dont understand the question
Using the latest version of the stack -
coupler - https://github.com/phyboyzhang/wdmapp_coupling-1/tree/szDev gene - https://github.com/phyboyzhang/gene/tree/rpi xgc-devel - https://github.com/Damilare06/XGC-Devel/tree/passThrough
This stack currently executes correctly on AiMOS. However, on Summit, the coupler fails after receiving the gene density and returns this error;
Our solution
Our first attempt at resolving this was with print statements and the investigation pointed at an
MPI_Allgatherv
that was located here This method proved insufficient as commenting this code out only resulted in multiple sets of segfaults that were non-deterministic.To resolve this, we decided to switch to Valgrind. This second attempt revealed a lot of MPI and adios2 warnings and errors as the coupler hangs within the preprocessing stage, just after receiving the
versurf
array defined here. For adios2 support, I opened this issue https://github.com/SCOREC/wdmapp_coupling/issues/84 To address the MPI leaks and warnings, I turned on Valgrind suppression using spectrum-mpi suppression and generated suppression.On a more recent run using Valgrind, the error is shown, print statements shows the coupler goes as far as below
84 mype, start count1 0 0 16 2160 | 72 0: receive gene_density done
85 rank=1 | 73 1: receive gene_density done
86 ../coupling/gene_density.bp | 74 2: receive gene_density done
87 creat engine for: gene_density | 75 3: receive gene_density done
88 engine parameters are set | 76 Error in `/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl': double free or corruption (! 89 Shape 0 1=4320 16 | prev): 0x00000000a76015a0
90 Shape 0 1=4320 16 | 77 ======= Backtrace: =========
91 mype, start count3 0 2160 16 2160 | 78 /lib64/libc.so.6(cfree+0x4a0)[0x200001ad9be0]
92 rank=3 | 79 /autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1002a628]
93 ../coupling/gene_density.bp | 80 /autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1000cfe4]
94 creat engine for: gene_density | 81 /lib64/libc.so.6(+0x25200)[0x200001a65200]
95 engine parameters are set | 82 /lib64/libc.so.6(__libc_start_main+0xc4)[0x200001a653f4]
96 Shape 0 1=4320 16 | 83 ======= Memory map: ========
97 mype, start count2 0 2160 16 2160 | 84 10000000-10170000 r-xp 00000000 00:30 154231132 /autofs/nccs-svm1_home1/dam 98 rank=2 | ilare/dev/build-Cpl/test/cpl
99 ../coupling/gene_density.bp | 85 10180000-10190000 r--p 00170000 00:30 154231132 /autofs/nccs-svm1_home1/dam 100 creat engine for: gene_density | ilare/dev/build-Cpl/test/cpl
101 engine parameters are set | 86 10190000-101a0000 rw-p 00180000 00:30 154231132 /autofs/nccs-svm1_home1/dam 102 Shape 0 1=4320 16 | ilare/dev/build-Cpl/test/cpl
103 numiter,ranki densityfromGENE=0 0 (8.82229,-0.0714541) | 87 101a0000-10270000 rw-p 00000000 00:00 0
104 numiter,ranki densityfromGENE=0 3 (28.0649,-0.0714279) | 88 217d0000-21a30000 rw-p 00000000 00:00 0 [heap]
105 numiter,ranki densityfromGENE=0 2 (21.4586,0.0714541) | 89 21a30000-21a40000 rw-p 00000000 00:00 0 [heap]
bp4/coupler/codar.workflow.stdout.coupler 105,1 Bot bp4/coupler/codar.workflow.stderr.coupler
129 engine parameters are set | 67 0.9
130 numiter,ranki densityfromGENE=0 3 (28.0649,-0.0714279) | 68 0.9
131 p3m3d.lj0*p3m3d.blockcount=5130432 | 69 0.9
132 numiter,ranki densitytoXGC=0 1 543136 5130433 | 70 0.9
133 sending Engine for cpl_density is created. | 71 0: gene_density engine created
134 rank=3 The cpl_density was written | 72 10: receive : receive gene_densitygene_density done
135 rank=3 | 73 done
136 ../coupling/xgc_field.bp | 74 23: receive : receive gene_densitygene_density done
137 creat engine for: xgc_field | 75 done
138 engine parameters are set | 76 ERROR: One or more process (first noticed rank 0) terminated with signal 12
bp4_val/coupler/codar.workflow.stdout.coupler 138,1 Bot bp4_val/coupler/codar.workflow.stderr.coupler