Update IO formats and add new IO namelist controls

apcraig commented 10 months ago

PR checklist

[X] Short (1 sentence) summary of your PR: Update IO formats and add new IO namelist controls
[X] Developer(s): apcraig, anton-s
[X] Suggest PR reviewers from list in the column to the right.
[X] Please copy the PR test results link or provide a summary of testing completed below. Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.
How much do the PR code changes differ from the unmodified code?
- [X] bit for bit
- [ ] different at roundoff level
- [ ] more substantial
Does this PR create or have dependencies on Icepack or any other models?
- [ ] Yes
- [X] No
Does this PR update the Icepack submodule? If so, the Icepack submodule must point to a hash on Icepack's main branch.
- [ ] Yes
- [X] No
Does this PR add any new test cases?
- [X] Yes
- [ ] No
Is the documentation being updated? ("Documentation" includes information on the wiki or in the .rst files from doc/source/, which are used to create the online technical docs at https://readthedocs.org/projects/cice-consortium-cice/. A test build of the technical docs will be performed as part of the PR testing.)
- [X] Yes
- [ ] No, does the documentation need to be updated at a later time?
  - [ ] Yes
  - [ ] No
[X] Please provide any additional information or relevant details below:

Closes #914 Closes #862

This provides new features for CICE IO both thru netCDF and PIO. New namelist are added to control history and restart format, hdf5 compression and chunking, the PIO rearranger, and PIO IO task control. Separate controls are provided for history and restart files. The namelist changes are for

_history_format, restart_format history_rearranger, restart_rearranger history_iotasks, history_root, history_stride, restart_iotasks, restart_root, and restart_stride history_chunksize, history_deflate, restart_chunksize, restartdeflate.

In particular,

Update restart_format and history_format options to 'cdf1', 'cdf2', 'cdf5', 'hdf5', 'pnetcdf1', 'pnetcdf2', 'pnetcdf5', 'default'. The old options, 'default', 'pio_netcdf', and 'pio_pnetcdf' are still supported and backwards compatible with lcdf64, but are deprecated and no longer documented. The old options and old namelist lcdf64 are covered by the new options. Support of the old options should be removed in the future. Note that some problems were discovered when opening files with hdf5 format but reading non-hdf5 files with a spack built PIO/netCDF. As a result, the format specified for the restart read is always 'cdf1' which provides flexibility and robustness across software installs, although it may result in serial reads of hdf5 files when a parallel read could be done.
Deprecate lcdf64 namelist. This namelist is no longer needed and is covered by the new restart_format and history_format options. The namelist still exists and is backwards compatible with the old 'default', 'pio_netcdf', and 'pio_pnetcdf' format options, but is no longer documented. This should be removed in the future.
Add new namelist to control PIO pe/task setup (iotasks, root, stride) for history and restart. These settings control the PIO IO tasks. The root, stride, and iotasks are consistent with the MPI communicator. root=0 is the first MPI task. These control PIO IO performance and are usually a function of things like the IO and node hardware. See PIO for more information. CICE computes PIO iotask, root, and stride defaults for cases where -99 is passed in for some or all of these namelist. Those defaults are somewhat constrained by a bug in PIO, https://github.com/NCAR/ParallelIO/issues/1986. The current implementation avoids the bug by limiting the iotasks for some MPI task counts. This is noted in ice_pio.F90.
Add new namelist to control PIO rearranger (rearranger) for history and restart. Supports 'box', 'subset', and 'default'. These control how PIO rearrangment is carried out. default is equivalent to box and the box generally performs better. See PIO for more information.
Add new namelist to support hdf5 compression and chunking (deflate, chunksize) for history and restart. The deflate controls file compression and is an integer between 0 and 9 where 0 means no compression and 9 is maximum compression. Generally, the higher the number, the slower the IO and the smaller the file, but the optimal setting depends on the contents of the file. Chunksize provides a performance control for the hdf5 parallel writes. It is a 2d array and is associated with the size of the piece of the array written by hdf5. hdf5 can be read and written in parallel, but that depends on how netCDF and PIO are built. Note that prior version of PIO, including PIO1, do not support the hdf5 compression and chunking thru the PIO interface.
Add new namelist settings (set_nml files) and update the io_suite to cover the new IO options. Remove old namelist settings associated with the deprecated format options and the lcdf64 namelist. These deprecated feature are no longer tested.
Update documentation to add new namelist and IO features.
Update the nuopc/cmeps driver code to support the new features.
Update the default ice_in to add the new namelist.
Update the derecho netcdf module to a version that supports hdf5.
Clean up some code formatting (indentation)

apcraig commented 10 months ago

This is a draft. There are still some changes to come. Creating a draft PR to facilitate development, support early review, and allow for quick review of documentation changes.

anton-seaice commented 10 months ago

Thanks Tony - I will go through in detail next week.

One thing I have been thinking about:

I think we should only be calling pio_init once, and then using that set of mpi threads for all restart/history files. I think that is how the library is intended to be used, but haven't found a good way to confirm that.
At the moment, we call pio_init once per file written/read, so as a minimum we should call pio_finalize when we close the file.

anton-seaice commented 10 months ago

I think

https://github.com/apcraig/CICE/blob/d2799fc643a8113f20b8bf9b5daa8239807c9006/configuration/scripts/options/set_nml.iopio2p#L2

history_format = 'pio_netcdf'

needs updating / has a typo and should be pio_pnetcdf

anton-seaice commented 10 months ago

I'm still working on the chunking ... I realised the module that I was going to get pio_def_var_chunking from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.

apcraig commented 10 months ago

I think

https://github.com/apcraig/CICE/blob/d2799fc643a8113f20b8bf9b5daa8239807c9006/configuration/scripts/options/set_nml.iopio2p#L2

history_format = 'pio_netcdf'

needs updating / has a typo and should be pio_pnetcdf

I noticed this issue too. Our tests weren't quite doing what I expected, but we were still testing pio_pnetcdf via restarts for pio2. Anyway, this file will be removed before this PR is finalized. I'm keeping a few older files temporarily to do regression testing against the older options, but it's just while the PR is evolving.

apcraig commented 10 months ago

I'm still working on the chunking ... I realised the module that I was going to get pio_def_var_chunking from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.

OK, so the concern is the chunking stuff will abort in PIO if someone is linking to an older version of PIO? A quick workaround might be to have some default chunking size (say -99). If that value is set, then don't use the chunking implementation, use the non-chunking one. And when the chunking call returns with an error, add an extra warning message that this could be caused by an older version of PIO and to avoid the error, set the chunking value to -99. Just an idea. And in that case, I'd set the default to -99 in code and namelist. Make folks have to turn it explicitly.

anton-seaice commented 10 months ago

I'm still working on the chunking ... I realised the module that I was going to get pio_def_var_chunking from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.

OK, so the concern is the chunking stuff will abort in PIO if someone is linking to an older version of PIO?

It doesn't compile because the routine is not defined.

My suggestion is to import it like use pio_nf, only: pio_def_var_chunking which seems programmatically the same as the new versions where its included through use pio.

I will wrap the chunking and compression with #ifndef USE_PIO1 too.

A quick workaround might be to have some default chunking size (say -99). If that value is set, then don't use the chunking implementation, use the non-chunking one. And when the chunking call returns with an error, add an extra warning message that this could be caused by an older version of PIO and to avoid the error, set the chunking value to -99. Just an idea. And in that case, I'd set the default to -99 in code and namelist. Make folks have to turn it explicitly.

At the moment I've just wrapped the chunking and compression in if _format='hdf5' checks, not quite as comprehensive as setting to -99.

apcraig commented 10 months ago

@anton-seaice, do you think we should apply the chunking only if chunksize>0 in code like

      if (restart_format=='hdf5' .and. size(dims)>1) then
         if (dims(1)==dimid_ni .and. dims(2)==dimid_nj) then
            chunks(1)=restart_chunksize(1)
            chunks(2)=restart_chunksize(2)
            do i = 3, size(dims)
               chunks(i) = 0
            enddo
            status = nf90_def_var_chunking(ncid, varid, NF90_CHUNKED, chunksizes=chunks)
            call ice_check_nc(status, subname//' ERROR: chunking var '//trim(vname), file=__FILE__, line=__LINE__)
         endif
      endif

Should that first line be

if (restart_format=='hdf5' .and. size(dims)>1 .and. maxval(restart_chunksize)>0) then

What happens when the chunks are all zeros and you call nf90_def_var_chunking? Is there a cost to calling nf90_def_var_chunking when we don't need to?

anton-seaice commented 10 months ago

What happens when the chunks are all zeros and you call nf90_def_var_chunking? Is there a cost to calling nf90_def_var_chunking when we don't need to?

If chunks are all zeros then it will set the chunksize to the maximum (i.e. the size of the array) unless the variable is more than 4MB, at which point some 'Default' chunksize is used.

If we made the change suggested:

For vars without unlimited variables (i.e coordinates), the storage type will default to _NCCONTIGUOUS (which I guess is the practically the same as the current behaviour of setting storage type to _NCCHUNKED with the chunksize equal to the array size).
For data vars with unlimited variables (i.e. data vars with a time dimension), the storage type will default to _NCCHUNKED anyway.

There might be some edge cases, where 'hdf5' and contiguous storage are needed, or where the default chunking scheme doesn't work appropriately and its not desired to set the chunksizes in the configuration, but these seem obscure enough that we can look at them if needed? Otherwise we would need to make the time variable length 1 (rather than unlimited) or add another namelist option for hdf5 and _NCCONTIGUOUS (or use _chunksize=-99 for contiguous, and _chunksize=0 for default chunksizes) but it would only be meaningful for variables >4MB.

anton-seaice commented 10 months ago

I guess that was a long way of saying, I think what we doing now is OK, but we might get feedback suggesting something different is needed in bespoke configurations. Noting we are not changing any current behaviour, it only affects configurations which move to 'hdf5' output.

apcraig commented 10 months ago

I guess that was a long way of saying, I think what we doing now is OK, but we might get feedback suggesting something different is needed in bespoke configurations. Noting we are not changing any current behaviour, it only affects configurations which move to 'hdf5' output.

Sounds good. Will leave as is. Thanks.

apcraig commented 10 months ago

I think this is ready for final review and merging. I am running a full test suite now.

@anton-seaice, let me know if I've missed anything or if you think we still have a few things to do. Thanks!

@eclare108213 @dabail10 @DeniseWorthen @phil-blain, feel free to have a look, let me know if you have any suggested changes.

This adds support for many new netCDF formats and some new controls for PIO. It also adds compression for hdf5 netcdf output. I've tested compression and it's pretty awesome. Thanks @anton-seaice!

anton-seaice commented 10 months ago

All looks good to me Tony :) After any more review comments I will retest a few combinations.

@dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).

dabail10 commented 10 months ago

All looks good to me Tony :) After any more review comments I will retest a few combinations.

@dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).

We would definitely prefer to get all PIO related settings from the driver. Since we are specifying the pe layout for the whole system, it knows how many io tasks and so on to assign each component. I would override the namelist settings here.

apcraig commented 10 months ago

I have run a full test suite on derecho and results are as expected. I updated the PR,

Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.

anton-seaice commented 10 months ago

All looks good to me Tony :) After any more review comments I will retest a few combinations. @dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).

We would definitely prefer to get all PIO related settings from the driver. Since we are specifying the pe layout for the whole system, it knows how many io tasks and so on to assign each component. I would override the namelist settings here.

OK - The IO tasks etc are unchanged for the nuopc driver.

There are new CICE namelist parameters - history_chunksize, history_compression, restart_chunksize and restart_compression added in this PR. We could add corresponding variables through nuopc to override them if that is neater?

anton-seaice commented 10 months ago

I have run a full test suite on derecho and results are as expected. I updated the PR,

Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.

Ok - sounds like its not something that is feasible/worthwhile for us to address? Is the failure repeatable or a bit random?

apcraig commented 10 months ago

I have run a full test suite on derecho and results are as expected. I updated the PR, Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.

Ok - sounds like its not something that is feasible/worthwhile for us to address? Is the failure repeatable or a bit random?

The failures are repeatable. The differences in the files are random. So some "junk" ends up in the file that does not impact results but is random. You can only see the difference with a cmp or od. If you do ncdiff or cprnc, the data looks fine. I'm reluctant to move away from our simple "cmp" just for the pio hdf5 results at this point. netcdf hdf5 does NOT behave this way, PIO folks are aware of this, and are not concerned. I'm pretty sure this is a PIO thing which we have little control over. I could explore further. I do know that CICE PIO and netcdf handle fill data slightly differently. but I don't think it's related to that. I think we should just live with this for now. I don't expect it will cause problems for users, but we should keep our ears open.

DeniseWorthen commented 9 months ago

@anton-seaice I'm trying to test the new features using the hdf5 option. I'm getting an error, which I believe is the result of the spack-stack not having built PIO w/ netcdf4/hdf5 enabled. The error message I get is

74:  (abort_ice)ABORTED:
174:  (abort_ice) called from ice_pio.F90
174:  (abort_ice) line number          238
174:  (abort_ice) error =
174:  (ice_pio_check)NetCDF: Attempt to use feature that was not turned on when netCD
174:  F was built., (ice_pio_init) ERROR: Failed to open file cice_model.res.nc

The message part that says Attempt to use feature that was not turned on when netCDF was built is coming from PIO itself I believe. But we use netcdf4/hdf5 for the ATM model so we should have netCDF built correctly. Does this message really indicate that PIO itself was not built w/ netcdf4/hdf5 support?

apcraig commented 9 months ago

@anton-seaice I'm trying to test the new features using the hdf5 option. I'm getting an error, which I believe is the result of the spack-stack not having built PIO w/ netcdf4/hdf5 enabled. The error message I get is
74:  (abort_ice)ABORTED:
174:  (abort_ice) called from ice_pio.F90
174:  (abort_ice) line number          238
174:  (abort_ice) error =
174:  (ice_pio_check)NetCDF: Attempt to use feature that was not turned on when netCD
174:  F was built., (ice_pio_init) ERROR: Failed to open file cice_model.res.nc
The message part that says Attempt to use feature that was not turned on when netCDF was built is coming from PIO itself I believe. But we use netcdf4/hdf5 for the ATM model so we should have netCDF built correctly. Does this message really indicate that PIO itself was not built w/ netcdf4/hdf5 support?

So this is happening on a read? And I assume cice_model.res.nc is not hdf5? We shouldn't have to tell PIO the netcdf format when we're reading, I'll have a look at the code again on that. But this would just fail later on a PIO hdf5 write, so it doesn't really solve the problem.

My guess is that the atm and PIO are compiled with and writing thru different version of the netcdf lib. I suspect this is possible, but am not 100% sure. Some questions/ideas. Is PIO being built on the fly? How consistent are the builds for each component? Can you verify what version of netcdf is compiled with the atm and pio libs? If you are using a "netcdf" module, you might try loading the equivalent "netcdf-mpi" module. To get the hdf5 to work on Derecho, I had to do

  module unload netcdf
  module load netcdf-mpi/4.9.2

You can see that in the PR. Just some thoughts. Keep us posted.

DeniseWorthen commented 9 months ago

The file it is reading is netcdf-classic. But, like you, I thought the read should be fine.

We build UWM with a spack-stack on each platform---and the basic build should be the same (but the exact intel or gnu compiler might be different). I'm still learning my way around spack-stack, this is the package dependencies (Jim is the package maintainer).

https://github.com/JCSDA/spack/blob/03ecae6b77b974c7f364fbbca7ad06e2bc1ba8ab/var/spack/repos/builtin/packages/parallelio/package.py#L33-L41

DeniseWorthen commented 9 months ago

Indeed, with

    restart_format = 'cdf1'
    history_format = 'hdf5'

I can write a history file (ncdump -k = netCDF-4), and read a netcdf-classic restart file (ncdump -k = classic). But I cannot read the restart file with format=hdf5.

If I convert the restart using nccopy -k 4 cice_model.res.nc cice_model.res.nc4 (yielding ncdump -k = netCDF-4 classic model), then I can use

    restart_format = 'hdf5'
    history_format = 'hdf5'

anton-seaice commented 9 months ago

The file it is reading is netcdf-classic. But, like you, I thought the read should be fine.

We build UWM with a spack-stack on each platform---and the basic build should be the same (but the exact intel or gnu compiler might be different). I'm still learning my way around spack-stack, this is the package dependencies (Jim is the package maintainer).

https://github.com/JCSDA/spack/blob/03ecae6b77b974c7f364fbbca7ad06e2bc1ba8ab/var/spack/repos/builtin/packages/parallelio/package.py#L33-L41

Hi Denise. I haven't seen this before. I also thought that all the netcdf formats should be backwards compatible for read.

Where is your spack environment file? The "variant" may be changed in the environment file too. Maybe you would need parallel-netcdf turned on for the netcdf-c package? (https://github.com/JCSDA/spack/blob/03ecae6b77b974c7f364fbbca7ad06e2bc1ba8ab/var/spack/repos/builtin/packages/netcdf-c/package.py#L124)

DeniseWorthen commented 9 months ago

Oh, good point. What I see from the JCDSA issue for netcdf-c is (this happens to be on Dom's OSX system) is

[+]      ^netcdf-c@4.9.2%apple-clang@13.1.6+blosc~byterange+dap~fsync~hdf4~jna+mpi~nczarr_zip+optimize~parallel-netcdf+pic+shared~szip+zstd build_system=autotools patches=0161eb8 arch=darwin-monterey-m1

I'm sure the + is "with" but I'm not entirely sure that ~ is "without"?

apcraig commented 9 months ago

This is all a bit surprising. How about changing this in cicecore/cicedyn/infrastructure/io/io_pio2/ice_restart.F90 as a sensitivity test? At about line 81, change

      call ice_pio_init(mode='read', filename=trim(filename), File=File, &
           fformat=trim(restart_format), rearr=trim(restart_rearranger), &
           iotasks=restart_iotasks, root=restart_root, stride=restart_stride, &
           debug=first_call)

to

      call ice_pio_init(mode='read', filename=trim(filename), File=File, &
           rearr=trim(restart_rearranger), &
           iotasks=restart_iotasks, root=restart_root, stride=restart_stride, &
           debug=first_call)

remove the "fformat" argument. This should not be needed for the read anyway in general. Then try to set restart_format='hdf5' and try reading both the classic and hdf5 version of the grid file. Does that work?

anton-seaice commented 9 months ago

I can replicate this error also by attempting to read an initial conditions file in format "netCDF classic" with restart type of "netcdf4p" (set through the CMEPS with CESMCOUPLED). It's using a spack build with no special options for netcdf-c/netcdf-fortrain/pio.

Options to fix are:

Require users to convert initial conditions files to hdf5. If we assume that the restart_format doesn't get changed part way through a run, then this is only a problem for initial conditions file, so could be avoided by doing a nccopy.
Default to iotype=cdf5 for intial conditions files only. (Same caveat as 1, assumes we don't support changing restart_type during a run).
Retry opening in ice_pio.F90. If the read fails with the requested iotype, retry with iotype="cdf1".
Require building with pnetcdf on. I haven't tested this, but suspect you can parallel read then?

anton-seaice commented 9 months ago

Oh, good point. What I see from the JCDSA issue for netcdf-c is (this happens to be on Dom's OSX system) is
[+]      ^netcdf-c@4.9.2%apple-clang@13.1.6+blosc~byterange+dap~fsync~hdf4~jna+mpi~nczarr_zip+optimize~parallel-netcdf+pic+shared~szip+zstd build_system=autotools patches=0161eb8 arch=darwin-monterey-m1
I'm sure the + is "with" but I'm not entirely sure that ~ is "without"?

That all looks normal I think. ~ is indeed "without"

apcraig commented 9 months ago

Can someone explain what's going on. I don't know spack. As I understand it from @DeniseWorthen, if restart_format is set to hdf5 and the restart file is hdf5, then the code reads and writes hdf5 just fine. But if the restart file is not hdf5, then it fails. netcdf should be able to convert formats automatically on read, but it doesn't in the case. I'm pretty sure this does work fine on derecho, it reads a classic restart and writes hdf5. So why does this seem to only be a problem with a spack build when even the spack build supports hdf5 read and write fine (just not mixing).

In general, the code supports reading and writing different restart formats (with some limits). But this should be allowed, right?

anton-seaice commented 9 months ago

Can someone explain what's going on. I don't know spack. As I understand it from @DeniseWorthen, if restart_format is set to hdf5 and the restart file is hdf5, then the code reads and writes hdf5 just fine. But if the restart file is not hdf5, then it fails. netcdf should be able to convert formats automatically on read, but it doesn't in the case. I'm pretty sure this does work fine on derecho, it reads a classic restart and writes hdf5. So why does this seem to only be a problem with a spack build when even the spack build supports hdf5 read and write fine (just not mixing).

In general, the code supports reading and writing different restart formats (with some limits). But this should be allowed, right?

You are on it Tony :)

The only difference I can think of is possibly the derecho configuration of netcdf-c includes 'parallel_netcdf' support ? Which maybe is needed for parallel read of netcdf classic files but not netcdf4 files.

apcraig commented 9 months ago

OK. I'd be curious to see whether my suggestion of changing the ice_pio_init call for the restart read changes anything. Maybe if we don't force the read to be hdf5 (if we're even doing that), that will read the classic and hdf5 while still writing hdf5 or other formats. I'm testing that change on derecho and it works fine both ways on derecho. Just wondering whether it'll work around the spack problems.

anton-seaice commented 9 months ago

OK. I'd be curious to see whether my suggestion of changing the ice_pio_init call for the restart read changes anything. Maybe if we don't force the read to be hdf5 (if we're even doing that), that will read the classic and hdf5 while still writing hdf5 or other formats. I'm testing that change on derecho and it works fine both ways on derecho. Just wondering whether it'll work around the spack problems.

I hardcoded line 236 in ice_pio to use PIO_IOTYPE_NETCDF for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.

apcraig commented 9 months ago

I hardcoded line 236 in ice_pio to use PIO_IOTYPE_NETCDF for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.

But that means you only wrote classic files? Can you try changing just the ice_pio_init call on the restart read as suggested above. I'm curious if that supports reading different file types and writing hdf5 (in parallel?). If so, that might be a good solution. I think if we can only read serially, that's probably an OK trade off for flexibility. As long as the writes are in parallel using hdf5 or pnetcdf. Performance is generally more of an issue for output as there is usually more output than input.

anton-seaice commented 9 months ago

I hardcoded line 236 in ice_pio to use PIO_IOTYPE_NETCDF for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.

But that means you only wrote classic files? Can you try changing just the ice_pio_init call on the restart read as suggested above. I'm curious if that supports reading different file types and writing hdf5 (in parallel?). If so, that might be a good solution. I think if we can only read serially, that's probably an OK trade off for flexibility. As long as the writes are in parallel using hdf5 or pnetcdf. Performance is generally more of an issue for output as there is usually more output than input.

Yeah - I only hardcoded the type for read operations. The output was still netcdf4 type.

Most likely parallel reads are not that important ... I don't have a good config to test that on at the moment. The other option would be try the parallel read, and if that fails, try a read for netcdf classic.

I will try and build the dependencies with pnetcdf enabled, well see!

apcraig commented 9 months ago

I think the most important thing is that it works robustly at this point. I think we should defer read performance, make sure it's not aborting when it probably shouldn't, and make sure the write performance is working as expected with netcdf, hdf5, pnetcdf and in parallel when we expect it should. It might be even that is too high a bar, keep me posted.

DeniseWorthen commented 9 months ago

Catching up here....I've now switched over to Derecho using the spack-stack built for UWM there. I have the same issue: using hdf5 as a restart_format fails when using a classic restart file. Converting it to a netcdf4-classic works for restart_format=hdf5. I'm now trying the code change that Tony suggested above for our build.

In terms of general solutions, I think yes, read performance is not as critical. I think suggestion 3 above seems a good compromise. Are we sure that would work in all cases?

Retry opening in ice_pio.F90. If the read fails with the requested iotype, retry with iotype="cdf1".

DeniseWorthen commented 9 months ago

Tony's modification to remove fformat=trim(restart_format) from the ice_pio_init on read works in my test case. I can read a classic restart when the format is set to hdf5.

What is happening under the covers when we remove the fformat? If you give it a netcdf-4 file, it will still read in parallel?

apcraig commented 9 months ago

I don't know enough about how hdf5 works to understand whether it's always reading in parallel or not. Plus we have the PIO interface, and I'm not totally sure whether there are some tricks happening under the covers there. Based on the current results, my speculation is as follows. If you open a file hdf5 thru PIO, it expects to read/write in parallel. If the file to be read is NOT an hdf5 file, then it can't read in parallel and an abort is triggered. So if you open a file as cdf1 thru PIO, it will read in serial and can read any filetype. The change I propose set the file open to cdf1 regardless of the setting of restart_format. That provides flexibility, still allows files to be written in hdf5 in parallel, but reads hdf5 in serial. That all just a guess at what's happening.

I think it makes sense to check in the change I proposed. I've tested it on Derecho (with non-spack builds of pio). Does anyone disagree?

@DeniseWorthen, do you think it's worth trying to set up standalone CICE testing using the spack builds? Would that be difficult? Also, maybe I'll try to generate different formats for the restart input files, so we add some test coverage for that.

DeniseWorthen commented 9 months ago

I've been wondering--Is there any way to use the netcdf inquire format function (nf_inq_format) to determine what to do for restart reads?

Add-on: Regarding spack-stack, I don't really know how hard this might be. Theoretically it seems you might be able to

module use /glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load ufs_derecho.intel

to provide all the same modules we use to build?

anton-seaice commented 9 months ago

Sorry for the Friday afternoon explanation ! ... Here goes. I will revisit Monday if needed :)

If you open a file hdf5 thru PIO, it expects to read/write in parallel. If the file to be read is NOT an hdf5 file, then it can't read in parallel and an abort is triggered. .

So the interesting thing here is, if the parallel read fails, the pio library should retry with a serial read and netcdf classic (aka PIO_TYPE_NETCDF). see link

I messed around for long enough to get extra PIO logging turned on (see cice.runlog.240209-153008.txt), and saw that the iotype when it retries to open the file using a serial open hadn't been updated. e.g.: logs like this:

1 retry nc_open(/g/data/ik11/inputs/CICE_data/ic/gx3/iced_gx3_v5.nc) : fd = -1, iotype = 4, do_io = 1, ierr = -128

Where nc_open is the serial read that is the retry, but the iotype = 4 (should be PIO_IOTYPE_NETCDF=2)

The parallelio changes this iotype before retrying:

            /* reset file markers for NETCDF on all tasks */
            file->iotype = PIO_IOTYPE_NETCDF;

I suspected that the variable file which is shared from fortran to the c binding, wasn't getting updated. So I made a local 'lFile' within ice_pio_init, and it fixed the problem! (See https://github.com/CICE-Consortium/CICE/commit/3ef71dd0c5f9cb7cac9983fd746c085b585bcbc3). I don't really understand what is going on here, but maybe you will Tony?

@DeniseWorthen, do you think it's worth trying to set up standalone CICE testing using the spack builds? Would that be difficult? Also, maybe I'll try to generate different formats for the restart input files, so we add some test coverage for that.

I have done this on Gadi - it was straightforward enough. Spack produces modulefiles, so as Denise says, you just need to module use and module load the modules produced via spack, rather than the default ones.

apcraig commented 9 months ago

Thanks @anton-seaice for all this work, very good sleuthing. So how should we handle things? Do you want to create a PR to the branch with something like https://github.com/CICE-Consortium/CICE/commit/3ef71dd0c5f9cb7cac9983fd746c085b585bcbc3 or should I update the branch based on your changes? Should we go with my solution which just sets the restart read to cdf1? I don't understand the fix, is whatever is going on here something we need to worry about elsewhere in PIO? We haven't run into many c/fortran issues with PIO, so I assume we're OK, but this is weird.

Separately, I'll try to get a spack thing going on Derecho, but maybe as a separate PR at a later time. It would be good to get this PR sorted out and merged if we can.

anton-seaice commented 9 months ago

Ok. There’s someone here I can ask about this sort of stuff, I’ll try that on Monday and see if it makes sense to him. Otherwise maybe to be safe we should just hardcode it to always do the serial read as you suggest.

DeniseWorthen commented 9 months ago

I also don't fully understand whether the issue is how we're using the PIO interface vs an issue in PIO itself.

I think it would be fine to commit this with forcing restart read to be serial and creating a followup issue (?). The PR already represents a substantial body of work.

EDIT: Also OK w/ waiting until Anton can ask his expert.

apcraig commented 9 months ago

I think it would be fine to commit this with forcing restart read to be serial and creating a followup issue (?). The PR already represents a substantial body of work.

I think this is a good plan. I will update, test, and push some changes onto the branch today, nothing particularly groundbreaking. I will also create an issue. If we decide to pursue other options next week on this branch, that's great. But it could also be reconsidered for a PR in the future if we decide to do that.

apcraig commented 9 months ago

I just updated the branch with a couple changes, https://github.com/CICE-Consortium/CICE/pull/928/commits/4f09d1f2c0abfa0c1ad499aed8d55695bdce1f06. Will run a full test suite this weekend and report/fix any problems found (I don't expect any).

I also added #932 and #933 in case we defer those issues.

Unless we have some quick resolution as to why or what is happening wrt #932, I recommend we merge what we have now (after final review) early next week and address #932 and #933 separately.

anton-seaice commented 9 months ago

I just updated the branch with a couple changes, 4f09d1f. Will run a full test suite this weekend and report/fix any problems found (I don't expect any).

I also added #932 and #933 in case we defer those issues.

Thankyou - sounds good.

Unless we have some quick resolution as to why or what is happening wrt #932, I recommend we merge what we have now (after final review) early next week and address #932 and #933 separately.

Yes ok. I raised https://github.com/NCAR/ParallelIO/issues/1985 and will updated in #932 as needed.

apcraig commented 9 months ago

Testing is complete and everything looks good. Anyone have any fine review comments before merging? Could a couple folks approve the PR?

dabail10 commented 9 months ago

This is pretty extensive and I am not sure when I can properly review this. I am at a meeting in Santa Fe this week. Doing some scouting for the CICE Consortium workshop in May. :)

DeniseWorthen commented 9 months ago

I've been testing this, and I'm wondering if I understand this correctly. I see in my ice_diag.d file

 (ice_pio_init) nprocs     =          720
 (ice_pio_init) pio_iotype =            1
 (ice_pio_init) iotasks    =          179
 (ice_pio_init) baseroot   =            1
 (ice_pio_init) stride     =            4
 (ice_pio_init) nmode      =           32
 (ice_pio_init) create file ./RESTART/iced.2013-04-01-10800.nc

Is the number of iotasks actually 179? Or does "1" get added downstream in the PIO interface to make it 180? That is, iotasks is set here as if the counting starts at 0, so 0:179.

apcraig commented 9 months ago

I've been testing this, and I'm wondering if I understand this correctly. I see in my ice_diag.d file
 (ice_pio_init) nprocs     =          720
 (ice_pio_init) pio_iotype =            1
 (ice_pio_init) iotasks    =          179
 (ice_pio_init) baseroot   =            1
 (ice_pio_init) stride     =            4
 (ice_pio_init) nmode      =           32
 (ice_pio_init) create file ./RESTART/iced.2013-04-01-10800.nc
Is the number of iotasks actually 179? Or does "1" get added downstream in the PIO interface to make it 180? That is, iotasks is set here as if the counting starts at 0, so 0:179.

iotasks is the number of tasks. So it is 179. What did you set in namelist? If you set iotasks to 180, I think it should use 180. But there also seems to be a problem in PIO that causes an abort with certain iotask counts in some cases when the iotasks is the "maximum", so I have coded around it. There is a whole section in icecore/cicedyn/infrastructure/io/io_pio2/ice_pio.F90 that sets the pio pe setup, lines 141-168. You can see at line 163 a comment about this. I end up reducing the iotasks to avoid this problem. The right equation is line 164 and that could be applied at line 144 as well. Again, had some aborts when I used line 164 even though it should work.

If you formally set iotasks to 180, does it work OK. Would it be better to use the right equation, giving the full iotasks and then sometimes finding the model aborts?

DeniseWorthen commented 9 months ago

@apcraig I am using all default settings so far. In this case, I am specifying (nprocs=720):

    restart_format      = 'pnetcdf5'
    restart_iotasks     = -99
    restart_rearranger  = 'default'
    restart_root        = -99
    restart_stride      = -99
    restart_chunksize   = 0,0
    restart_deflate     = 0

Edit: one of the reasons I asked is if I want to set it manually, I was wondering if I needed to account for a "1" somewhere.

CICE-Consortium / CICE

Update IO formats and add new IO namelist controls #928

PR checklist