Closed TillRasmussen closed 6 months ago
I will check up on why it fails
I will check up on why it fails
I noticed these failures last week with GHActions - it may not be related to your change, but something in the dependencies/environment failing.
Looks like the runs with debugging flags are failing. Might still be an env problem...
I'll have a look at the code mods and debug github actions in the next day or two.
I have run the following test suites before and after the implementation: base_suite,decomp_suite,gridsys_suite with and without debug and with both intel and fortran. All runs without debug Pass. Most of the runs pass with debug except for 24 test with gbox and boxchan1*. These fail both in the current upstream and in the pull request. Therefore I dont think that is an issue of this pull request. It seems to stop when it tries to write min/max of lon/lat on line 2579 of ice_grid.F90 in the version in the pull request.
@TillRasmussen, can you provide the list of the 24 tests that are failing when you force debug on.
The github actions failures are unrelated. github actions is throwing a floating point exception with no traceback or other information for a couple of the debug tests. I have not been able to duplicate elsewhere yet. Looks like our github actions testing recently switched automatically from MacOS 12.6.9 to 12.7.2. Clang changed from 15.0.7 to 16.0.6. gfortran is still using 12.3.0. I'm sure this is related. Frustrating that we constantly have to battle this stuff. I'll see if I can revert back to the prior version.
Error message in freya_intel_smoke_gbox80_8x1_boxchan1n_gridc_debug.grid_testfull_debug forrtl: error (72): floating overflow Image PC Routine Line Source cice 00000000023171EE Unknown Unknown Unknown cice 0000000001CBF2A0 Unknown Unknown Unknown cice 0000000000ED8606 ice_grid_mp_nelat 2579 ice_grid.F90 cice 0000000000E706C2 ice_grid_mpinit 668 ice_grid.F90 cice 000000000040A993 cice_initmod_mp_c 130 CICE_InitMod.F90 cice 000000000040A70B cice_initmod_mp_c 54 CICE_InitMod.F90 cice 000000000040A391 MAIN__ 43 CICE.F90 cice 000000000040A31E Unknown Unknown Unknown cice 00000000023E48F9 Unknown Unknown Unknown cice 000000000040A209 Unknown Unknown Unknown _pmiu_daemon(SIGCHLD): [NID 00134] [c0-0c2s1n2] [Wed Dec 20 13:11:46 2023] PE RANK 0 exit signal Aborted [NID 00134] 2023-12-20 13:11:46 Apid 84897883: initiated application termination Application 84897883 exit codes: 134 Application 84897883 resources: utime ~0s, stime ~0s, Rss ~14372, inblocks ~0, outblocks ~0
@TillRasmussen, not the error message, just the list of the 24 tests that failed. Like freya_intel_smoke_gbox80_8x1_boxchan1n_gridc_debug
I'd like to add some tests to the test suite and fix the error.
See the link to errorlog.log in the top of the same comment
@TillRasmussen, perfect, thanks.
@TillRasmussen, you should be able to pull the latest from the current trunk to fix github actions if you like (not necessarily required as long as we tested the latest code elsewhere). I also fixed the problem you found with the single channel debug run and updated the test suite to include that test.
Separately, I ran a full test suite of these modifications on Derecho with the latest trunk code and it all looks fine, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#28cb2baff01e4338e73be5c07721d2bc99f682e2.
I still want to review the code a bit more. But first, there are several places where code is commented out, should that be cleaned up?
Just realized my testing did NOT include the newgrid mods. Am retesting now.
All is bit for bit on my test after Tony fixed the single channel bug. I would like to remove the code that has been commented out but I am not sure if it should be kept for legacy reasons.
* line 203 in ice_dyn_shared. Why is this there? If it's a future placeholder for the 1d implementation, lets remove this commented out line and add the 1d stuff at a later time cleanly. And fix indentation of lines 204-210.
You are right it should not be there and it was a leftover from a test with 1d. I decided to add it later. Indent will be fixed.
* in ice_dyn_shared, should we move the dxhy, dyhx, cyp, cxp, cym, cxm initialization to the big "block loop" above with the halo update outside. And do we want/need to initialize those arrays to c0 first (the halo values are uninitialized)? If not, lines 367-406 need to be indented properly.
We can add the initialization but it was not initialized in the original version. As a general thing I don't think that we should initialize to c0 from 1 to nx_block/ny_block. I would rather do it from ilo-nghost to ihi+nghost. For dxhy and dyhx then a halo update is applied with the result that the halo zone has a value. I would think that the right thing to do is to run the loop from ilo to ihi+nghost and jlo to jhi+nghost and skip the halo update afterwards. This assumes that HTE and HTN exist on the halo. And it may cause the result to be not bit for bit. The four other parameters are calculated where they are used according to the T grid.
* adjust indentation on the allocate at lines 215-220 in ice_dyn_evp to be consistent with allocate statements above. Also, the ratiod* variables are not initialized on the halo, is that OK? Again, should those arrays be set globally to c0 before being filled with valid values in ilo:ihi, jlo:jhi?
Indent is fixed in next push. There was no initialization of these in the original code. I can add this.
* The new "use" statements in ice_dyn_vp at lines 2746 and 3148 are not needed, remove them.
They are needed because they have been removed from the module declaration. If they remain there then I think that we should remove them from the subroutine arguments (e,g, matvec).
* In ice_grid in subroutine latlongrid, the initialization of dxhy, dyhx, cxp, cyp, cxm, cym are removed. These values will be set in ice_dyn_shared now and end up with different values. I assume this is OK since I assume they are not important (latlongrid is NOT use with dynamics), but it's an option that is NOT tested in the standalone version of the code, a risk I was concerned about in making these changes.
They are set to a large number in latlongrid which is only called when cpp flag CESMCOUPLED is true. That is why I marked them with Till and I would like @dabail10 to look at it as I am not sure if this is used and when. It should not be used within the standalone version and I assume that it is not used when it is set to a value like this.
- An important question is whether xav, yav, xxav, and yyav should be scalars or arrays. There are several points. First, the current code suggests there may be cases where these variables vary in space (even though they don't currently). Is that likely to be implemented? Second, making them scalars requires that "limited_gradient" be overloaded with a completely redundant subroutine, just changing two arguments from arrays to scalars. This seems like NOT a great outcome. Third, somewhat associated with the first point, are the extra arrays xyav, xxxav, etc which are now set to c0 and largely commented out ever going to be used/needed, should we keep that code around? I favor the following if it makes sense,
Whether or not all of these variables to be used or not is something that others have to judge. This is the reason why I did not remove all the code where it has been commented out. xxav and yyav do not have any limitations when they are scalar. xav and yav is argument to the limited_gradient subroutine, however this routine is also called with other arguments that are varying in space. I made a choice with an interface. I am happy to change it back and make xav and yav 2d arrays. One thing I would like to do is to remove the calculations where xav and yav is used in all other places than in the limited_gradient routine as it adds terms that are multiplied by c0. This also replies to the comment below.
* keep xav, yav, xxav, yyav arrays and have just a single version of limited_gradient. I understand this is a little inefficient for memory, but the cost is small versus the tradeoff of having two identical subroutines to maintain. Alternatively, xav, yav, xxav, yyav could be scalars, but copied into temporary arrays before calling limited gradient. Memory use would be similar (although potentially dynamic), but I like that approach less than just making them arrays from the start * remove xyav, xxxav, xxyav, xyyav, yyyav completely unless we anticipate them being useful in the next 5 years.
I agree that the above variables should be removed. But I think that others should make the call. This also relates to your comment below.
- Finally, I agree that commented out code should be removed.
Thanks @TillRasmussen. Just a quick followup.
You are right that we need the use statements in ice_dyn_vp, my checking before must have been done improperly.
Traditionally, we initialize from 1:nx_block, 1:ny_block. By definition, this is identical to ilo-nghost:ihi+nghost, and I see no reason to change our current approach or make it more complicated. If an array has a halo update after ilo:ihi, jlo:jhi is initialized, then I think it's OK not to initialize the halo before the halo update. But there are open questions about whether we should initialize arrays when we allocate them or whether every memory location should be initialized in the code even if it's not used. I don't think we have a formal standard at this time.
The latlongrid should be OK as you have it implemented, just pointing out that it is changing.
I would like to hear others thought about the xxav arrays. Could they be spatially varying at some point? Should we get rid of the arrays we're not using right now? @eclare108213? I still strongly favor having a single implementation of limited_gradient regardless.
New version of ice transport remap. Still bit for bit. Removed limited_gradient interface, removed multiplications with xav and yav (commented out as they are equal zero). Only present as 2d array in call to limited_gradient. Now local variable in construct_fields subroutine.
We need to clarify a plan for xav, yav, xxav, yyav, etc. I'm not sure about making xxav and yyav parameters and making xav and yav local arrays in construct_fields is the way to go. Lets formulate the long term plan then decide where the fields should live and how they should be initialized. @eclare108213? @whlipscomb?
To be more specific, this block in subroutine init_remap defines (and comments out) a bunch of arrays that could more easily be treated as scalars and/or removed from the code (including in ice_grid where they are defined by also commented out). There are some comments about the possibility that the arrays would be spatially varying and the need for additional terms for non-rectangular grids, but that has never been implemented. The question is, should we leave all this in place, like it is now. Or can we completely remove arrays that are not used (like xxxav, xxyav, etc) and make xav, yav, xxav, yyav scalar parameters if that makes the code cleaner? I'm not advocating one way or another myself, but this PR basically does that and I just want to make sure it's OK to do so.
!$OMP PARALLEL DO PRIVATE(iblk,i,j) SCHEDULE(runtime)
do iblk = 1, nblocks
do j = 1, ny_block
do i = 1, nx_block
xav(i,j,iblk) = c0
yav(i,j,iblk) = c0
!!! These formulas would be used on a rectangular grid
!!! with dimensions (dxT, dyT):
!!! xxav(i,j,iblk) = dxT(i,j,iblk)**2 / c12
!!! yyav(i,j,iblk) = dyT(i,j,iblk)**2 / c12
xxav(i,j,iblk) = c1/c12
yyav(i,j,iblk) = c1/c12
! xyav(i,j,iblk) = c0
! xxxav(i,j,iblk) = c0
! xxyav(i,j,iblk) = c0
! xyyav(i,j,iblk) = c0
! yyyav(i,j,iblk) = c0
enddo
enddo
enddo
!$OMP END PARALLEL DO
Those arrays (and comments) have been in the code, commented out, for a LONG time. I think it's fine to clean it all up and get rid of them completely. @whlipscomb is there any conceivable reason why we might need them anytime soon?
This PR now has all the "extra" grid metrics removed from the dynamics computations. @TillRasmussen, have you checked whether the compilers produce bit-for-bit results? There are enough code changes that it may not. I will fire off a test suite today as well and will post the results. Otherwise, I think this could be ready for merging. I encourage others to review the latest updates, maybe we can merge this week?
It is bit for bit I have run the following suites base_suite decomp_suite gridsys_suite with gnu and intel and with and without debug
Independent test results look good, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#f848367767b492ec41ba176e091818aef83ce6ea
For detailed information about submitting Pull Requests (PRs) to the CICE-Consortium, please refer to: https://github.com/CICE-Consortium/About-Us/wiki/Resource-Index#information-for-developers
PR checklist