daniel-koehn / DENISE-Black-Edition

2D time-domain isotropic (visco)elastic FD modeling and full waveform inversion (FWI) code for P/SV-waves
GNU General Public License v2.0
121 stars 66 forks source link

Exception ("merge: can't read model file !") in mergemod.c #34

Closed pplotn closed 3 years ago

pplotn commented 3 years ago

Sometimes, during my using of Denise PSV I get following error ("merge: can't read model file !") in mergemod.c. What can be the reasons for this? I am using 12 nodes 32 cpu each. NPROCX=4,NPROCY=4

**Message from mergemod (printed by PE 0): PE 0 starts merge of 16 model files writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin.??? ... finished. Copying... ... finished. Use ximage n1=384 < ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin label1=Y label2=X title=./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_vs_stage_1_it_10.bin to visualize model.

PE 0 is writing model to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0

**Message from mergemod (printed by PE 0): PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin Opening model files: ./fwi/ws_fwi_3_strategy_51/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0 R U N - T I M E E R R O R: merge: can't read model file ! ...now exiting to system.

-rw-r--r-- 1 plotnips k1404 0 May 19 22:17 modelTest_rho_stage_1_it_10.bin -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.0 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.1 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.2 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.0.3 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.0 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.1 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.2 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.1.3 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.0 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.1 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.2 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.2.3 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.0 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.1 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.2 -rw-r--r-- 1 plotnips k1404 90K May 19 22:17 modelTest_rho_stage_1_it_10.bin.3.3

daniel-koehn commented 3 years ago

Hi Pavel,

Assuming that you used 16 CPU cores for the parallelization with domain decompositon, the remaining cores are used for shot parallelization. How many shots are you modelling in total? Are they dividible by 24 without any remainder? Does the problem also occur when using less cores for the shot parallelization, or in the extreme case only using the domain decomposition?

Best regards,

Daniel

pplotn commented 3 years ago

Hello Daniel, I am modeling 51 shots. As I understand, I use 4*4=16 cores per shot. Overall, I have 12*32=384 cores. It means, that I parallelize over 384/16=24 shots. It means, I need 3 iterations to go through al the 51 shots.

This exception is very rare, I don't get it for other model size and number of shots.

20320209ws_fwi_3_strategy_51_Overthrust_true.err.txt 20320209ws_fwi_3_strategy_51_Overthrust_true.out.txt

daniel-koehn commented 3 years ago

Hi Pavel,

I have the suspicion, that one problem when using shot parallelization might be, that non-merged model files are removed in PSV/model_it_out_PSV:

https://github.com/daniel-koehn/DENISE-Black-Edition/blob/master/src/PSV/model_it_out_PSV.c

Try to comment or delete all remove() functions in model_it_out_PSV.c and recompile the source code, before running the code again. If this is indeed the issue, similar problems will occur in gauss_filt.c and gauss_filt_var.c

Best regards,

Daniel

pplotn commented 3 years ago

Ok, thanks Daniel. I recompiled the code and the problem still occurs on the same velocity model. Though on other models it is not happening.

PE 0 is writing model to ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.0.0 **Message from mergemod (printed by PE 0): PE 0 starts merge of 16 model files

writing merged model file to ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin Opening model files: ./fwi/ws_fwi_3_strategy_55/Overthrust_true/fld/model/modelTest_rho_stage_1_it_10.bin.??? Message from PE 0 R U N - T I M E E R R O R: merge: can't read model file ! ...now exiting to system.

pplotn commented 3 years ago

Hello, in my experience setting Nprocx and Nprocy helps to get rid of this error. It works with parallelization by shots enabled.

pplotn commented 3 years ago

Increasing stringsize variable in fd.h file helped.

daniel-koehn commented 3 years ago

That makes sense. If the stringsize of the model name and directory are longer than the pre-defined maximum stringsize in fd.h, the numbering of the domain decomposition might be missing in the file name extension of the model files. Therefore, the mergemod function will fail to merge the model files from the different sub-domains correctly. Thank you for finding this bug, Pavel.

pplotn commented 3 years ago

Yes, Daniel. I have a bit complicated paths to my folders. So I increased STRINGSIZE to 150.