SCOREC / pcms

BSD 3-Clause "New" or "Revised" License
2 stars 13 forks source link

Coupler Double Corruption error on Summit #85

Closed Damilare06 closed 4 years ago

Damilare06 commented 4 years ago

Using the latest version of the stack -

coupler - https://github.com/phyboyzhang/wdmapp_coupling-1/tree/szDev gene - https://github.com/phyboyzhang/gene/tree/rpi xgc-devel - https://github.com/Damilare06/XGC-Devel/tree/passThrough

This stack currently executes correctly on AiMOS. However, on Summit, the coupler fails after receiving the gene density and returns this error;

engine parameters are set                                                                               |0.9                                                                                                       
mype, start count3 0 2160 16 2160                                                                       |0.9                                                                                                       
rank=3                                                                                                  |0.9                                                                                                       
../coupling/gene_density.bp                                                                             |0.9                                                                                                       
creat engine for: gene_density                                                                          |0: gene_density engine created                                                                            
engine parameters are set                                                                               |3: receive gene_density done                                                                              
mype, start count2 0 2160 16 2160                                                                       |2: receive gene_density done                                                                              
rank=2                                                                                                  |1: receive gene_density done                                                                              
../coupling/gene_density.bp                                                                             |0: receive gene_density done                                                                              
creat engine for: gene_density                                                                          |*** Error in `/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl': double free or corruption (!prev):
engine parameters are set                                                                               | 0x00000000b3998760 ***                                                                                   
Shape 0 1=4320 16                                                                                       |======= Backtrace: =========                                                                              
Shape 0 1=4320 16                                                                                       |/lib64/libc.so.6(cfree+0x4a0)[0x200001ad9be0]                                                             
Shape 0 1=4320 16                                                                                       |/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1002a628]                                       
Shape 0 1=4320 16                                                                                       |/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1000cfe4]                                       
numiter,ranki densityfromGENE=0 0 (8.82229,-0.0714541)                                                  |/lib64/libc.so.6(+0x25200)[0x200001a65200]                                                                
numiter,ranki densityfromGENE=0 2 (21.4586,0.0714541)                                                   |*** Error in `/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl': double free or corruption (!prev):
numiter,ranki densityfromGENE=0 3 (28.0649,-0.0714279)                                                  | 0x000000009c8922c0 ***                                                                                   
double_corr/coupler/codar.workflow.stdout.coupler                                     105,1          Bot double_corr/coupler/codar.workflow.stderr.coupler  

Our solution


This points to another failure at the pre-processing stage - within the `dataprocess` class
- The third attempt was to switch from SST to BP4 for SST validation: Without Valgrind, using BP4 causes the coupler to die at the same spot as with SST (the Double Corruption Error)as seen below;

84 mype, start count1 0 0 16 2160 | 72 0: receive gene_density done
85 rank=1 | 73 1: receive gene_density done
86 ../coupling/gene_density.bp | 74 2: receive gene_density done
87 creat engine for: gene_density | 75 3: receive gene_density done
88 engine parameters are set | 76 Error in `/autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl': double free or corruption (! 89 Shape 0 1=4320 16 | prev): 0x00000000a76015a0
90 Shape 0 1=4320 16 | 77 ======= Backtrace: =========
91 mype, start count3 0 2160 16 2160 | 78 /lib64/libc.so.6(cfree+0x4a0)[0x200001ad9be0]
92 rank=3 | 79 /autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1002a628]
93 ../coupling/gene_density.bp | 80 /autofs/nccs-svm1_home1/damilare/dev/build-Cpl/test/cpl[0x1000cfe4]
94 creat engine for: gene_density | 81 /lib64/libc.so.6(+0x25200)[0x200001a65200]
95 engine parameters are set | 82 /lib64/libc.so.6(__libc_start_main+0xc4)[0x200001a653f4]
96 Shape 0 1=4320 16 | 83 ======= Memory map: ========
97 mype, start count2 0 2160 16 2160 | 84 10000000-10170000 r-xp 00000000 00:30 154231132 /autofs/nccs-svm1_home1/dam 98 rank=2 | ilare/dev/build-Cpl/test/cpl
99 ../coupling/gene_density.bp | 85 10180000-10190000 r--p 00170000 00:30 154231132 /autofs/nccs-svm1_home1/dam 100 creat engine for: gene_density | ilare/dev/build-Cpl/test/cpl
101 engine parameters are set | 86 10190000-101a0000 rw-p 00180000 00:30 154231132 /autofs/nccs-svm1_home1/dam 102 Shape 0 1=4320 16 | ilare/dev/build-Cpl/test/cpl
103 numiter,ranki densityfromGENE=0 0 (8.82229,-0.0714541) | 87 101a0000-10270000 rw-p 00000000 00:00 0
104 numiter,ranki densityfromGENE=0 3 (28.0649,-0.0714279) | 88 217d0000-21a30000 rw-p 00000000 00:00 0 [heap]
105 numiter,ranki densityfromGENE=0 2 (21.4586,0.0714541) | 89 21a30000-21a40000 rw-p 00000000 00:00 0 [heap]
bp4/coupler/codar.workflow.stdout.coupler 105,1 Bot bp4/coupler/codar.workflow.stderr.coupler

A final attempt is to run the coupler with BP4 and Valgrind: The first run produces the below result:

129 engine parameters are set | 67 0.9
130 numiter,ranki densityfromGENE=0 3 (28.0649,-0.0714279) | 68 0.9
131 p3m3d.lj0*p3m3d.blockcount=5130432 | 69 0.9
132 numiter,ranki densitytoXGC=0 1 543136 5130433 | 70 0.9
133 sending Engine for cpl_density is created. | 71 0: gene_density engine created
134 rank=3 The cpl_density was written | 72 10: receive : receive gene_densitygene_density done
135 rank=3 | 73 done
136 ../coupling/xgc_field.bp | 74 23: receive : receive gene_densitygene_density done
137 creat engine for: xgc_field | 75 done
138 engine parameters are set | 76 ERROR: One or more process (first noticed rank 0) terminated with signal 12
bp4_val/coupler/codar.workflow.stdout.coupler 138,1 Bot bp4_val/coupler/codar.workflow.stderr.coupler



It can be seen that the coupler proceeds further with Valgrind - getting to its `xgc_field` calls but fails after receiving the `gene_density` when Valgrind is not in use.
In summary, it appears that the following bugs exist;

- The coupler encounters an error on `Summit` that still needs debugging. With or without Valgrind and with or without BP4, the run is never past `gene_density` receive. 
- There is a bug with the use of SST mode that causes the Valgrind runs to fail at the preprocessing phase.
- BP4 with Valgrind proceeds further than without Valgrind

Please find attached a copy of the process output and the generated suppression file

[58574.txt](https://github.com/SCOREC/wdmapp_coupling/files/4974223/58574.txt)
[58575.txt](https://github.com/SCOREC/wdmapp_coupling/files/4974224/58575.txt)
[58576.txt](https://github.com/SCOREC/wdmapp_coupling/files/4974225/58576.txt)
[58577.txt](https://github.com/SCOREC/wdmapp_coupling/files/4974226/58577.txt)
[cleaned_output.supp.txt](https://github.com/SCOREC/wdmapp_coupling/files/4974229/cleaned_output.supp.txt)
phyboyzhang commented 4 years ago

@Damilare06 I think you may try print statement again, because the line you spotted here is not the latest commit(if it's spotted right). That spotted line as well as the two lines above is not right, because a dynamically allocated pointer is used to point to another allocated pointer. But if Valgrind works on our codes, the debug burden would be hugely relieved in the future.

Damilare06 commented 4 years ago

@Damilare06 I think you may try print statement again, because the line you spotted here is not the latest commit(if it's spotted right). That spotted line as well as the two lines above is not right, because a dynamically allocated pointer is used to point to another allocated pointer. But if Valgrind works on our codes, the debug burden would be hugely relieved in the future.

Yes Shuangxi, I referenced those old locations from previous commits to show the solution logic and to better document the approach. Right now Valgrind works in identifying some errors such as https://github.com/SCOREC/wdmapp_coupling/issues/86

At the moment, Cameron has advised I run an adios2 simple test on Summit with Valgrind to check if the same errors exist

phyboyzhang commented 4 years ago

Can valgrind be applicable to WDMAPP without coupler? If they can, changing Sst to BP4 looks not useful.

Damilare06 commented 4 years ago

to wdmapp? I dont understand the question