NOAA-EMC / WW3

WAVEWATCH III
Other
267 stars 545 forks source link

unstructured PDLIB regtests not reproducing with different number of MPI tasks #322

Open JessicaMeixner-NOAA opened 3 years ago

JessicaMeixner-NOAA commented 3 years ago

When running the following regression tests with a different number of MPI tasks and then comparing with matrix.base, the regression tests do not reproduce:

./bin/run_test  -s MPI -s PDLIB -w work_c -g c -f -p mpirun -n 36 ../model ww3_tp2.17
./bin/run_test -s MPI -s PDLIB -w work_mb -m grdset_b -f -p mpirun -n 36 ../model ww3_tp2.17
./bin/run_test -s MPI -s PDLIB -w work_mc -m grdset_c -f -p mpirun -n 36 ../model ww3_tp2.17
./bin/run_test -s MPI -s PDLIB -w work_pdlib -g pdlib -f -p mpirun -n 36 ../model ww3_tp2.6
benoitp-cmc commented 2 years ago

Could it be related to not using INIT_GET_ISEA here: https://github.com/NOAA-EMC/WW3/blob/c6408fafde123ce37728824d1baa0d8471306e49/model/src/w3iorsmd.F90#L795

aronroland commented 1 year ago

Yes, it seems a good candidate. :)

Many thanks for looking at this and pointing out this issue. Good catch!

JessicaMeixner-NOAA commented 1 year ago

This has been solved for the block-explicit scheme (see issue #961 for more issues). Issues with restarts not being b4b remain.

Other schemes likely still have issues, some of the schemes are known to not have b4b qualities. A reassessment of what is needed to achieve this for the max possible schemes is unknown.

MatthewMasarik-NOAA commented 1 year ago

Non-identical file counts from matrix.comp when run for MPI tasks = 24 vs. MPI tasks = 36.

ww3_tp2.17/./work_mb                     (4 files differ)               
ww3_tp2.17/./work_mc                     (7 files differ)               
ww3_tp2.17/./work_c                      (9 files differ)
ww3_tp2.6/./work_pdlib                   (13 files differ)
aliabdolali commented 1 year ago

Hi @MatthewMasarik-NOAA this is expected (these ones you named here are all tests with the implicit scheme), the implicit schemes in the model cannot be b4b identical with different numbers of CPUs. it is not only in WW3, but also in many other models that are using implicit schemes. I gave a lecture on this topic a while ago. This is the nature of the scheme. And that is the reason we designed the GFSv17 case block explicit because that scheme is b4b by nature. Hope these are helpful. @aronroland and @thesser1 were a part of the conversation.

MatthewMasarik-NOAA commented 1 year ago

Hi @aliabdolali this is good to know about the implicit schemes, thanks for mentioning it. I can say the end goal of this issue is to address restart reproducibility in the unstructured cases to satisfy ORT's for coupling in UFS. After an offline conversation with @JessicaMeixner-NOAA the restart reproducibility issues seen have overlap on explicit unstructured as well. Resolving those are the first priority here. The lecture you mention sounds really interesting. If you have slides you're willing to share please let me know here or offline.

DeniseWorthen commented 1 year ago

@MatthewMasarik-NOAA I'm confused. We have both a restart and an mpi reproducibility test in the current UFS RTs. Is there a new issue here?

JessicaMeixner-NOAA commented 1 year ago

@MatthewMasarik-NOAA I'm confused. We have both a restart and an mpi reproducibility test in the current UFS RTs. Is there a new issue here?

The restart file itself is not, or has that issue resolved itself?

JessicaMeixner-NOAA commented 1 year ago

@DeniseWorthen sorry I didn't tag you in the reply. Matt's going to be working on resolving the issue with the restart file itself not being reproducible unless that problem was solved already and I was unaware, in which that's great!

DeniseWorthen commented 1 year ago

@JessicaMeixner-NOAA OK, thanks. As far as I know the restart file for WW3 is often not reproducible in the UFS RTs---but that is true for both the structured and unstructured cases.

MatthewMasarik-NOAA commented 1 year ago

Hi @aliabdolali @aronroland I've been trying to run the PDLIB block explicit solver and have been hitting seg faults after integration starts.

The setup I'm using is basically a copy of ww3_tp2.21, calling this run test:

./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_b  -g b   -f -p srun -n 24 ../model ww3_tp2.21

And the main changes are in ww3_grid_b.inp (I've tried both .nml vs. .inp), that I believe should invoke the block explicit and turn of any implicit schemes:

&UNST                              
  UGBCCFL = T,                     
  UGOBCAUTO = F,                   
  UGOBCDEPTH= -10.,                
  EXPFSN = F,                      
  EXPFSPSI = F,                    
  EXPFSFCT = F,                    
  IMPFSN = F,                      
  EXPTOTAL = T,                    
  IMPTOTAL = F,                    
  IMPREFRACTION = F,               
  IMPFREQSHIFT = F,                
  IMPSOURCE = F,                   
  SETUP_APPLY_WLV = F,             
  SOLVERTHR_SETUP=1E-14,           
  CRIT_DEP_SETUP=0.1,              
  JGS_USE_JACOBI = T,              
  JGS_BLOCK_GAUSS_SEIDEL = T,      
  JGS_TERMINATE_MAXITER = T,       
  JGS_MAXITER = 1000,              
  JGS_TERMINATE_NORM = F,          
  JGS_TERMINATE_DIFFERENCE = T,    
  JGS_DIFF_THR = 1.E-8,            
  JGS_PMIN = 3.0,                  
  JGS_LIMITER = F,                 
  JGS_NORM_THR = 1.E-6 /


Could you advise if I'm using a valid &UNST namelist block to turn on block explicit? if not, what settings would you suggest (most stable configuration)?

aliabdolali commented 1 year ago

@MatthewMasarik-NOAA what is the reason to try ww3_tp2.21 instead of ww3_ufs1.1/unstr? When we made tp2.21, our aim was to add unstr capabilities for global applications with the obstruction option, which is beyond the scope of block-explicit work. The setup is not approapriate for applications similar to GFS (which block explicit was implemented for). We would need a deeper understanding of your setup. I'd recommend testing ww3_ufs1.1/unstr for your purpose. When I was designing the GFSv17 path towards utilizing unstr meshes, I mentioned to EMC management (Vijay and Avichal) that mesh is key. Unstructured meshes are way more complicated than curvilinear grids and one with adequate expertise needs to spend a decent amount of time and analysis on the setup. I hope it is helpful. We can provide more guidance as needed.

MatthewMasarik-NOAA commented 1 year ago

@aliabdolali thanks, that's very helpful. My first assumption was that tp2.21 would be best as a base for the block explicit. I understand from what you said though that ufs1.1/unstr is better for this purpose. I'll try again using that test as a template. Thanks again!

aliabdolali commented 1 year ago

@MatthewMasarik-NOAA glad to provide more info/guide