Open JessicaMeixner-NOAA opened 3 years ago
Could it be related to not using INIT_GET_ISEA
here:
https://github.com/NOAA-EMC/WW3/blob/c6408fafde123ce37728824d1baa0d8471306e49/model/src/w3iorsmd.F90#L795
Yes, it seems a good candidate. :)
Many thanks for looking at this and pointing out this issue. Good catch!
This has been solved for the block-explicit scheme (see issue #961 for more issues). Issues with restarts not being b4b remain.
Other schemes likely still have issues, some of the schemes are known to not have b4b qualities. A reassessment of what is needed to achieve this for the max possible schemes is unknown.
Non-identical file counts from matrix.comp
when run for MPI tasks = 24 vs. MPI tasks = 36.
ww3_tp2.17/./work_mb (4 files differ)
ww3_tp2.17/./work_mc (7 files differ)
ww3_tp2.17/./work_c (9 files differ)
ww3_tp2.6/./work_pdlib (13 files differ)
Hi @MatthewMasarik-NOAA this is expected (these ones you named here are all tests with the implicit scheme), the implicit schemes in the model cannot be b4b identical with different numbers of CPUs. it is not only in WW3, but also in many other models that are using implicit schemes. I gave a lecture on this topic a while ago. This is the nature of the scheme. And that is the reason we designed the GFSv17 case block explicit because that scheme is b4b by nature. Hope these are helpful. @aronroland and @thesser1 were a part of the conversation.
Hi @aliabdolali this is good to know about the implicit schemes, thanks for mentioning it. I can say the end goal of this issue is to address restart reproducibility in the unstructured cases to satisfy ORT's for coupling in UFS. After an offline conversation with @JessicaMeixner-NOAA the restart reproducibility issues seen have overlap on explicit unstructured as well. Resolving those are the first priority here. The lecture you mention sounds really interesting. If you have slides you're willing to share please let me know here or offline.
@MatthewMasarik-NOAA I'm confused. We have both a restart and an mpi reproducibility test in the current UFS RTs. Is there a new issue here?
@MatthewMasarik-NOAA I'm confused. We have both a restart and an mpi reproducibility test in the current UFS RTs. Is there a new issue here?
The restart file itself is not, or has that issue resolved itself?
@DeniseWorthen sorry I didn't tag you in the reply. Matt's going to be working on resolving the issue with the restart file itself not being reproducible unless that problem was solved already and I was unaware, in which that's great!
@JessicaMeixner-NOAA OK, thanks. As far as I know the restart file for WW3 is often not reproducible in the UFS RTs---but that is true for both the structured and unstructured cases.
Hi @aliabdolali @aronroland I've been trying to run the PDLIB block explicit solver and have been hitting seg faults after integration starts.
The setup I'm using is basically a copy of ww3_tp2.21
, calling this run test:
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -w work_b -g b -f -p srun -n 24 ../model ww3_tp2.21
And the main changes are in ww3_grid_b.inp
(I've tried both .nml vs. .inp), that I believe should invoke the block explicit and turn of any implicit schemes:
&UNST
UGBCCFL = T,
UGOBCAUTO = F,
UGOBCDEPTH= -10.,
EXPFSN = F,
EXPFSPSI = F,
EXPFSFCT = F,
IMPFSN = F,
EXPTOTAL = T,
IMPTOTAL = F,
IMPREFRACTION = F,
IMPFREQSHIFT = F,
IMPSOURCE = F,
SETUP_APPLY_WLV = F,
SOLVERTHR_SETUP=1E-14,
CRIT_DEP_SETUP=0.1,
JGS_USE_JACOBI = T,
JGS_BLOCK_GAUSS_SEIDEL = T,
JGS_TERMINATE_MAXITER = T,
JGS_MAXITER = 1000,
JGS_TERMINATE_NORM = F,
JGS_TERMINATE_DIFFERENCE = T,
JGS_DIFF_THR = 1.E-8,
JGS_PMIN = 3.0,
JGS_LIMITER = F,
JGS_NORM_THR = 1.E-6 /
Could you advise if I'm using a valid &UNST
namelist block to turn on block explicit? if not, what settings would you suggest (most stable configuration)?
@MatthewMasarik-NOAA what is the reason to try ww3_tp2.21 instead of ww3_ufs1.1/unstr? When we made tp2.21, our aim was to add unstr capabilities for global applications with the obstruction option, which is beyond the scope of block-explicit work. The setup is not approapriate for applications similar to GFS (which block explicit was implemented for). We would need a deeper understanding of your setup. I'd recommend testing ww3_ufs1.1/unstr for your purpose. When I was designing the GFSv17 path towards utilizing unstr meshes, I mentioned to EMC management (Vijay and Avichal) that mesh is key. Unstructured meshes are way more complicated than curvilinear grids and one with adequate expertise needs to spend a decent amount of time and analysis on the setup. I hope it is helpful. We can provide more guidance as needed.
@aliabdolali thanks, that's very helpful. My first assumption was that tp2.21 would be best as a base for the block explicit. I understand from what you said though that ufs1.1/unstr is better for this purpose. I'll try again using that test as a template. Thanks again!
@MatthewMasarik-NOAA glad to provide more info/guide
When running the following regression tests with a different number of MPI tasks and then comparing with matrix.base, the regression tests do not reproduce: