CICE-Consortium / CICE

Development repository for the CICE sea-ice model
Other
57 stars 131 forks source link

Update pio and netcdf error checks #927

Closed apcraig closed 8 months ago

apcraig commented 8 months ago

PR checklist

Closes #858

apcraig commented 8 months ago

I will update the test results when complete.

Recommend reviewing the changes with "ignore whitespace" turned on. I ended up fixing a bunch of indentation issues (sorry about that, but it always feels good). To do that, go to "Files Changed" and under settings click the box.

This represents the first phase of the IO upgrade, https://docs.google.com/document/d/1yNK1MV2v51MF_P_qz0Mva34ZxbGSwoPnNvXkm5Q_Hs0. See also #862, #914

apcraig commented 8 months ago

Full test suite run on Derecho cray, gnu, intel including io_suite, results look good, https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#dc14ce98d71a742d9132d7ed8b7ecad3a39bd0a1

anton-seaice commented 8 months ago

Thanks Craig - well done. Did you mean to update the docs? You said yes in the description - but there are no changes included. (Possibly USE_PIO1 should be included, although maybe its not 'officially' supported as it looks quite old).

We could add a note about PIO2 vs PIO1 or supported/tested PIO versions in https://cice-consortium-cice.readthedocs.io/en/main/user_guide/ug_running.html#software-requirements ?

anton-seaice commented 8 months ago

I think we can check for the specific error code here, and then use ice_check_nc if it is a different error code?:

https://github.com/CICE-Consortium/CICE/blob/dc14ce98d71a742d9132d7ed8b7ecad3a39bd0a1/cicecore/cicedyn/infrastructure/ice_grid.F90#L951

(Possibly NC_ENOTVAR)

anton-seaice commented 8 months ago

See suggestions here : https://github.com/anton-seaice/CICE/commit/d1eece938f2a21f89a58617091815e665a7a678f , it didnt let me merge with this PR

dabail10 commented 8 months ago

I will try to take a look tomorrow. Today is a holiday for some of us in the USA (MLK day).

apcraig commented 8 months ago

@anton-seaice, thanks for the review. Lots of good comments. I have updated the branch. Will address your comments here.

anton-seaice commented 8 months ago

@anton-seaice, thanks for the review. Lots of good comments. I have updated the branch. Will address your comments here.

  • I cleaned up some long line lengths and some ' ERROR: ' syntax and error messages as suggested, just in the io/infrastructure routines though. We try to keep things consistent throughout as much as we can, but it's always a work in progress (like indentation).

  • I switched from colon to comma in the netcdf and pio error output. That's a good idea. I think it's OK to have an explicit message defined by CICE like "ERROR: ..." even if netcdf/pio will write "ERROR ...". And I'm trying to add file and line numbers on the abort calls. On an error, I think it's OK to lean towards extra and even redundant information in the error message especially if there is any chance it's easier to debug.

Thanks

  • I'm not inclined to use things like NC_ENOTATT unless they are absolutely necessary. I think it adds a bit of unnecessary risk and complexity. We only have a few places where we don't abort on an error and in those cases, I think it's fine to just check that there was a general error and take the next step, rather than differentiate for specific netcdf/pio errors. Willing to reconsider if others feel differently.

I guess the risk only checking for "NOERR" is that we mask some other error than the one we are expecting. e.g. a malformed restart file? Its a pretty minor risk because its likely the error would also be in some other part of the file than just say 'nmonth'.

  • You suggested a couple changes, to remove a redundant set_pioerrorhandling and to remove some output when opening a file, both in ice_pio.F90. I think both of those are OK as is. The pioerrorhandling redundancy is for CESMCOUPLED and I'm happy to let them manage this setting. The writing of the file opening is probably somewhat useful overall.

There are several sections of code with these

call pio_seterrorhandling(ice_pio_subsystem, PIO_RETURN_ERROR)
call pio_seterrorhandling(ice_pio_subsystem, PIO_INTERNAL_ERROR)

Around a section of code where we use the pio library. So it would need to be changed in multiple places if say, a group wanted to change it to Internal or Broadcast.

The way it is written, where the flag is changed on/off in multiple times in code is probably confusing to follow for developers. But there are some calls to the pio library which don't return a status code, so it needs to be PIO_INTERNAL_ERROR for those calls.

  • Regarding adding parallelio to the conda_macos env. Did you get this to work? Everything works fine for me when using netcdf (with the conda parallelio + -lpiof, but if I setup a case with pio2 on my mac (at home), it dumps core very early. I think it's OK to include parallelio for conda_macos as it doesn't seem to cause problems with netcdf, but if it doesn't work with pio, maybe we shouldn't include it? Along those lines, I modified the Macros file for conda_macos to turn on the -lpiof -lpioc only when pio2 is chosen just to make it even less likely that the parallelio will cause a problem later. But am curious what your experience is with parallelio with conda_macos.

Yes - this basically "just works":

$ conda env update -n cice --file=configuration/scripts/machines/environment.yml
$ conda activate cice
$ ./cice.setup --test smoke -m conda -e macos --set iopio2 --testid ioerrchk
$  cd conda_macos_smoke_gx3_4x1_iopio2.ioerrchk
$  ./cice.build
$  ./cice.run

These are the versions it installs: conda_cice_versions.txt. I set it up a while ago, but don't remember having to do anything weird.

anton-seaice commented 8 months ago

A few more ERROR messages to tidyup:

https://github.com/anton-seaice/CICE/commit/40a56b375aef0f6fcc7bf0771db8b9e4329a4493

apcraig commented 8 months ago

Thanks @anton-seaice, I cherry-picked your latest modifications. There are always more things to fix like this. Also, I meant to mention that while you can't push to my branches, you can PR to them. Generally, that's what we do if we're collaborating on development. Someone sets up a branch and others PR to it. Feel free to do that if it's useful at times.

I have not worked to improve

call pio_seterrorhandling(ice_pio_subsystem, PIO_RETURN_ERROR)
call pio_seterrorhandling(ice_pio_subsystem, PIO_INTERNAL_ERROR)

I think what we want is for the default to be "internal error" and for us to switch to "return error" in the sections where we actually check errors. In practice, we could probably just set "internal error" everywhere and let pio abort for us not worry about return codes too much. Then we would use "return error" ONLY when we want to ignore an error. Anyway, there are various appropriate strategies. But now that we've added all the error checking in our code, we have more options. The one important issue is when pio doesn't return error codes. I think we set "internal error" around those, but we certainly could look at all that more carefully.

apcraig commented 8 months ago

Also, glad to hear the pio is working for you with conda macos. I'm using an older version of the OS (and maybe hardware), so that might be part of my problem. If it works for you, I'm happy to leave it in, especially as it doesn't seem to impact the netcdf option even if the pio option doesn't work all the time. It's kind of cool there is conda pio support now, thanks for adding it.

DeniseWorthen commented 8 months ago

@apcraig We need to make a change in the cmeps driver related to #921 in order to compile and test this.

diff --git a/cicecore/drivers/nuopc/cmeps/ice_mesh_mod.F90 b/cicecore/drivers/nuopc/cmeps/ice_mesh_mod.F90
index 9493add..0c430f6 100644
--- a/cicecore/drivers/nuopc/cmeps/ice_mesh_mod.F90
+++ b/cicecore/drivers/nuopc/cmeps/ice_mesh_mod.F90
@@ -437,7 +437,8 @@ subroutine ice_mesh_init_tlon_tlat_area_hm()

     use ice_grid      , only : tlon, tlat, hm, tarea, ULON, ULAT, HTN, HTE, ANGLE, ANGLET
     use ice_grid      , only : uarea, uarear, tarear!, tinyarea
-    use ice_grid      , only : dxT, dyT, dxU, dyU, dyhx, dxhy, cyp, cxp, cym, cxm
+    use ice_grid      , only : dxT, dyT, dxU, dyU
+    use ice_dyn_shared, only : dxhy, dyhx, cxp, cyp, cxm, cym
     use ice_grid      , only : makemask
     use ice_boundary  , only : ice_HaloUpdate
     use ice_domain    , only : blocks_ice, nblocks, halo_info, distrb_info
apcraig commented 8 months ago

Thanks @DeniseWorthen. I have updated the calendar error messages. I also update the nuopc driver mesh file. The mesh file changes are related to a prior code mod, but happy to include it here. We try to keep the driver code up to date with changes to the standalone model, but don't always succeed. Let us know anytime if we can help resolve anything like that when it comes up.

anton-seaice commented 8 months ago

I think what we want is for the default to be "internal error" and for us to switch to "return error" in the sections where we actually check errors.

Ok - this is what I have done in this change.

anton-seaice commented 8 months ago

Thanks @DeniseWorthen. I have updated the calendar error messages. I also update the nuopc driver mesh file. The mesh file changes are related to a prior code mod, but happy to include it here. We try to keep the driver code up to date with changes to the standalone model, but don't always succeed. Let us know anytime if we can help resolve anything like that when it comes up.

I compiled, including these changes, using the nuopc driver and CESMCOUPLED set, and found one typo in ice_pio - sent you a PR Tony.

apcraig commented 8 months ago

Any additional review comments. If not, I'll run a final test suite and then maybe we can merge before the weekend? Thanks.

phil-blain commented 8 months ago

Hi Tony, I haven't had the time yet to review this branch. I'd like to take a look. I'll try to do it this afternoon.

apcraig commented 8 months ago

Thanks @phil-blain, I'm happy to delay the merge to next week too. Just looking to start the second phase of the IO updates. No rush though.

apcraig commented 8 months ago

@phil-blain. Thanks for your review. I have made the updates you suggest. Let me also add some discussion.

The question of whether to use

 status = nf90_put_att(ncid,nf90_global,'io_flavor','io_netcdf')
 call ice_check_nc(status, subname// ' ERROR: global attribute io_flavor', file=__FILE__, line=__LINE__)

or

call ice_check_nc(nf90_put_att(ncid,nf90_global,'io_flavor','io_netcdf'), subname// ' ERROR: global attribute io_flavor', file=__FILE__, line=__LINE__)

is a good one. I thought about it before updating the implementation. The pio code uses the latter and the netcdf code uses the former. For me, I prefer the former as I find it much more readable, especially if the netcdf has to be debugged. I agree that either way works. Maybe a bigger question is whether the pio and netcdf codes should use the same strategy. At this point, I think we can leave as is unless someone feels strongly we should use the same error checking approach in pio and netcdf.

With respect to maintaining the error checks, this is a bit of a challenge. Checking the icepack abort code consistently is another similar "requirement" as is use of "subname" and so forth. We just have to try to remember to review these things. Also, periodically, if I see an problem, I go through the code quickly and update as needed. Not ideal, but I don't know if we can automate.

I added a close #858 to the PR (in the PR description above).

I think I've addressed everyone's comments up to this point, feel free to review again and/or let me know if I've missed something.

apcraig commented 8 months ago

Also, I changed the missingvalue=1.234e30 to spval_dbl. I was thinking we wanted to avoid spval_dbl as the default because it might actually mean something in the fields that we might want to avoid (like maybe land vs water), but decided, after many questions, that it makes sense to use spval_dbl there.

anton-seaice commented 8 months ago

With respect to maintaining the error checks, this is a bit of a challenge. Checking the icepack abort code consistently is another similar "requirement" as is use of "subname" and so forth. We just have to try to remember to review these things. Also, periodically, if I see an problem, I go through the code quickly and update as needed. Not ideal, but I don't know if we can automate.

We could add a check item to the PR checklist? (i.e. do the changes follow with coding standard outlined in the documentation, including defining subname for each subrountine and function and calling icepack_warnings_flush after icepack calls?)

Also - there might be a bot that could respond to each PR with a reviever checklist (which would have the items from the coding standard on it).