Closed apcraig closed 9 months ago
This is a draft. There are still some changes to come. Creating a draft PR to facilitate development, support early review, and allow for quick review of documentation changes.
Thanks Tony - I will go through in detail next week.
One thing I have been thinking about:
I think
history_format = 'pio_netcdf'
needs updating / has a typo and should be pio_pnetcdf
I'm still working on the chunking ... I realised the module that I was going to get pio_def_var_chunking
from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.
I think
history_format = 'pio_netcdf'
needs updating / has a typo and should be
pio_pnetcdf
I noticed this issue too. Our tests weren't quite doing what I expected, but we were still testing pio_pnetcdf via restarts for pio2. Anyway, this file will be removed before this PR is finalized. I'm keeping a few older files temporarily to do regression testing against the older options, but it's just while the PR is evolving.
I'm still working on the chunking ... I realised the module that I was going to get
pio_def_var_chunking
from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.
OK, so the concern is the chunking stuff will abort in PIO if someone is linking to an older version of PIO? A quick workaround might be to have some default chunking size (say -99). If that value is set, then don't use the chunking implementation, use the non-chunking one. And when the chunking call returns with an error, add an extra warning message that this could be caused by an older version of PIO and to avoid the error, set the chunking value to -99. Just an idea. And in that case, I'd set the default to -99 in code and namelist. Make folks have to turn it explicitly.
I'm still working on the chunking ... I realised the module that I was going to get
pio_def_var_chunking
from only added the call in very recent versions (https://github.com/NCAR/ParallelIO/blame/6539ef05ae7584ec570a56fdab9f7dfb336c2b80/src/flib/pio.F90#L96), even though the functionality has existed for a long time. So ill need to figure out a better way to call the chunking definition.OK, so the concern is the chunking stuff will abort in PIO if someone is linking to an older version of PIO?
It doesn't compile because the routine is not defined.
My suggestion is to import it like use pio_nf, only: pio_def_var_chunking
which seems programmatically the same as the new versions where its included through use pio
.
I will wrap the chunking and compression with #ifndef USE_PIO1
too.
A quick workaround might be to have some default chunking size (say -99). If that value is set, then don't use the chunking implementation, use the non-chunking one. And when the chunking call returns with an error, add an extra warning message that this could be caused by an older version of PIO and to avoid the error, set the chunking value to -99. Just an idea. And in that case, I'd set the default to -99 in code and namelist. Make folks have to turn it explicitly.
At the moment I've just wrapped the chunking and compression in if _format='hdf5'
checks, not quite as comprehensive as setting to -99.
@anton-seaice, do you think we should apply the chunking only if chunksize>0 in code like
if (restart_format=='hdf5' .and. size(dims)>1) then
if (dims(1)==dimid_ni .and. dims(2)==dimid_nj) then
chunks(1)=restart_chunksize(1)
chunks(2)=restart_chunksize(2)
do i = 3, size(dims)
chunks(i) = 0
enddo
status = nf90_def_var_chunking(ncid, varid, NF90_CHUNKED, chunksizes=chunks)
call ice_check_nc(status, subname//' ERROR: chunking var '//trim(vname), file=__FILE__, line=__LINE__)
endif
endif
Should that first line be
if (restart_format=='hdf5' .and. size(dims)>1 .and. maxval(restart_chunksize)>0) then
What happens when the chunks are all zeros and you call nf90_def_var_chunking? Is there a cost to calling nf90_def_var_chunking when we don't need to?
What happens when the chunks are all zeros and you call nf90_def_var_chunking? Is there a cost to calling nf90_def_var_chunking when we don't need to?
If chunks are all zeros then it will set the chunksize to the maximum (i.e. the size of the array) unless the variable is more than 4MB, at which point some 'Default' chunksize is used.
If we made the change suggested:
There might be some edge cases, where 'hdf5' and contiguous storage are needed, or where the default chunking scheme doesn't work appropriately and its not desired to set the chunksizes in the configuration, but these seem obscure enough that we can look at them if needed? Otherwise we would need to make the time variable length 1 (rather than unlimited) or add another namelist option for hdf5 and _NCCONTIGUOUS (or use _chunksize=-99
for contiguous, and _chunksize=0
for default chunksizes) but it would only be meaningful for variables >4MB.
I guess that was a long way of saying, I think what we doing now is OK, but we might get feedback suggesting something different is needed in bespoke configurations. Noting we are not changing any current behaviour, it only affects configurations which move to 'hdf5' output.
I guess that was a long way of saying, I think what we doing now is OK, but we might get feedback suggesting something different is needed in bespoke configurations. Noting we are not changing any current behaviour, it only affects configurations which move to 'hdf5' output.
Sounds good. Will leave as is. Thanks.
I think this is ready for final review and merging. I am running a full test suite now.
@anton-seaice, let me know if I've missed anything or if you think we still have a few things to do. Thanks!
@eclare108213 @dabail10 @DeniseWorthen @phil-blain, feel free to have a look, let me know if you have any suggested changes.
This adds support for many new netCDF formats and some new controls for PIO. It also adds compression for hdf5 netcdf output. I've tested compression and it's pretty awesome. Thanks @anton-seaice!
All looks good to me Tony :) After any more review comments I will retest a few combinations.
@dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).
All looks good to me Tony :) After any more review comments I will retest a few combinations.
@dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).
We would definitely prefer to get all PIO related settings from the driver. Since we are specifying the pe layout for the whole system, it knows how many io tasks and so on to assign each component. I would override the namelist settings here.
I have run a full test suite on derecho and results are as expected. I updated the PR,
Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.
All looks good to me Tony :) After any more review comments I will retest a few combinations. @dabail10 - I snuck some related changes to the nuopc driver in. It gets a little bit messy because the io_type comes from nuopc, but the chunking and compression settings come from the cice namelist. (If its better to look at the nuopc driver as a seperate change let me know).
We would definitely prefer to get all PIO related settings from the driver. Since we are specifying the pe layout for the whole system, it knows how many io tasks and so on to assign each component. I would override the namelist settings here.
OK - The IO tasks etc are unchanged for the nuopc driver.
There are new CICE namelist parameters - history_chunksize, history_compression, restart_chunksize and restart_compression added in this PR. We could add corresponding variables through nuopc to override them if that is neater?
I have run a full test suite on derecho and results are as expected. I updated the PR,
Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.
Ok - sounds like its not something that is feasible/worthwhile for us to address? Is the failure repeatable or a bit random?
I have run a full test suite on derecho and results are as expected. I updated the PR, Test results look good. https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks#1619d6ca92028d1c0341be06081d226882a0739a. There is a problem in the bit-for-bit hdf5 restart files generated from PIO. The model restarts exactly, but the check that compares restart files fails because there is some non-important data in the PIO hdf5 files that is not bit reproducible. Have discussed this with PIO folks and they say it's OK. I don't fully agree, netCDF hdf5 does not have this problem. But it means the PIO hdf5 restart tests will fail the "cmp" validation even though they restart the model exactly.
Ok - sounds like its not something that is feasible/worthwhile for us to address? Is the failure repeatable or a bit random?
The failures are repeatable. The differences in the files are random. So some "junk" ends up in the file that does not impact results but is random. You can only see the difference with a cmp or od. If you do ncdiff or cprnc, the data looks fine. I'm reluctant to move away from our simple "cmp" just for the pio hdf5 results at this point. netcdf hdf5 does NOT behave this way, PIO folks are aware of this, and are not concerned. I'm pretty sure this is a PIO thing which we have little control over. I could explore further. I do know that CICE PIO and netcdf handle fill data slightly differently. but I don't think it's related to that. I think we should just live with this for now. I don't expect it will cause problems for users, but we should keep our ears open.
@anton-seaice I'm trying to test the new features using the hdf5
option. I'm getting an error, which I believe is the result of the spack-stack not having built PIO w/ netcdf4/hdf5 enabled. The error message I get is
74: (abort_ice)ABORTED:
174: (abort_ice) called from ice_pio.F90
174: (abort_ice) line number 238
174: (abort_ice) error =
174: (ice_pio_check)NetCDF: Attempt to use feature that was not turned on when netCD
174: F was built., (ice_pio_init) ERROR: Failed to open file cice_model.res.nc
The message part that says Attempt to use feature that was not turned on when netCDF was built
is coming from PIO itself I believe. But we use netcdf4/hdf5 for the ATM model so we should have netCDF built correctly. Does this message really indicate that PIO itself was not built w/ netcdf4/hdf5 support?
@anton-seaice I'm trying to test the new features using the
hdf5
option. I'm getting an error, which I believe is the result of the spack-stack not having built PIO w/ netcdf4/hdf5 enabled. The error message I get is74: (abort_ice)ABORTED: 174: (abort_ice) called from ice_pio.F90 174: (abort_ice) line number 238 174: (abort_ice) error = 174: (ice_pio_check)NetCDF: Attempt to use feature that was not turned on when netCD 174: F was built., (ice_pio_init) ERROR: Failed to open file cice_model.res.nc
The message part that says
Attempt to use feature that was not turned on when netCDF was built
is coming from PIO itself I believe. But we use netcdf4/hdf5 for the ATM model so we should have netCDF built correctly. Does this message really indicate that PIO itself was not built w/ netcdf4/hdf5 support?
So this is happening on a read? And I assume cice_model.res.nc is not hdf5? We shouldn't have to tell PIO the netcdf format when we're reading, I'll have a look at the code again on that. But this would just fail later on a PIO hdf5 write, so it doesn't really solve the problem.
My guess is that the atm and PIO are compiled with and writing thru different version of the netcdf lib. I suspect this is possible, but am not 100% sure. Some questions/ideas. Is PIO being built on the fly? How consistent are the builds for each component? Can you verify what version of netcdf is compiled with the atm and pio libs? If you are using a "netcdf" module, you might try loading the equivalent "netcdf-mpi" module. To get the hdf5 to work on Derecho, I had to do
module unload netcdf
module load netcdf-mpi/4.9.2
You can see that in the PR. Just some thoughts. Keep us posted.
The file it is reading is netcdf-classic. But, like you, I thought the read should be fine.
We build UWM with a spack-stack on each platform---and the basic build should be the same (but the exact intel or gnu compiler might be different). I'm still learning my way around spack-stack, this is the package dependencies (Jim is the package maintainer).
Indeed, with
restart_format = 'cdf1'
history_format = 'hdf5'
I can write a history file (ncdump -k = netCDF-4), and read a netcdf-classic restart file (ncdump -k = classic). But I cannot read the restart file with format=hdf5.
If I convert the restart using nccopy -k 4 cice_model.res.nc cice_model.res.nc4
(yielding ncdump -k = netCDF-4 classic model), then I can use
restart_format = 'hdf5'
history_format = 'hdf5'
The file it is reading is netcdf-classic. But, like you, I thought the read should be fine.
We build UWM with a spack-stack on each platform---and the basic build should be the same (but the exact intel or gnu compiler might be different). I'm still learning my way around spack-stack, this is the package dependencies (Jim is the package maintainer).
Hi Denise. I haven't seen this before. I also thought that all the netcdf formats should be backwards compatible for read.
Where is your spack environment file? The "variant" may be changed in the environment file too. Maybe you would need parallel-netcdf turned on for the netcdf-c package? (https://github.com/JCSDA/spack/blob/03ecae6b77b974c7f364fbbca7ad06e2bc1ba8ab/var/spack/repos/builtin/packages/netcdf-c/package.py#L124)
Oh, good point. What I see from the JCDSA issue for netcdf-c is (this happens to be on Dom's OSX system) is
[+] ^netcdf-c@4.9.2%apple-clang@13.1.6+blosc~byterange+dap~fsync~hdf4~jna+mpi~nczarr_zip+optimize~parallel-netcdf+pic+shared~szip+zstd build_system=autotools patches=0161eb8 arch=darwin-monterey-m1
I'm sure the + is "with" but I'm not entirely sure that ~ is "without"?
This is all a bit surprising. How about changing this in cicecore/cicedyn/infrastructure/io/io_pio2/ice_restart.F90 as a sensitivity test? At about line 81, change
call ice_pio_init(mode='read', filename=trim(filename), File=File, &
fformat=trim(restart_format), rearr=trim(restart_rearranger), &
iotasks=restart_iotasks, root=restart_root, stride=restart_stride, &
debug=first_call)
to
call ice_pio_init(mode='read', filename=trim(filename), File=File, &
rearr=trim(restart_rearranger), &
iotasks=restart_iotasks, root=restart_root, stride=restart_stride, &
debug=first_call)
remove the "fformat" argument. This should not be needed for the read anyway in general. Then try to set restart_format='hdf5' and try reading both the classic and hdf5 version of the grid file. Does that work?
I can replicate this error also by attempting to read an initial conditions file in format "netCDF classic" with restart type of "netcdf4p" (set through the CMEPS with CESMCOUPLED). It's using a spack build with no special options for netcdf-c/netcdf-fortrain/pio.
Options to fix are:
restart_format
doesn't get changed part way through a run, then this is only a problem for initial conditions file, so could be avoided by doing a nccopy.ice_pio.F90
. If the read fails with the requested iotype, retry with iotype="cdf1".Oh, good point. What I see from the JCDSA issue for netcdf-c is (this happens to be on Dom's OSX system) is
[+] ^netcdf-c@4.9.2%apple-clang@13.1.6+blosc~byterange+dap~fsync~hdf4~jna+mpi~nczarr_zip+optimize~parallel-netcdf+pic+shared~szip+zstd build_system=autotools patches=0161eb8 arch=darwin-monterey-m1
I'm sure the + is "with" but I'm not entirely sure that ~ is "without"?
That all looks normal I think. ~ is indeed "without"
Can someone explain what's going on. I don't know spack. As I understand it from @DeniseWorthen, if restart_format is set to hdf5 and the restart file is hdf5, then the code reads and writes hdf5 just fine. But if the restart file is not hdf5, then it fails. netcdf should be able to convert formats automatically on read, but it doesn't in the case. I'm pretty sure this does work fine on derecho, it reads a classic restart and writes hdf5. So why does this seem to only be a problem with a spack build when even the spack build supports hdf5 read and write fine (just not mixing).
In general, the code supports reading and writing different restart formats (with some limits). But this should be allowed, right?
Can someone explain what's going on. I don't know spack. As I understand it from @DeniseWorthen, if restart_format is set to hdf5 and the restart file is hdf5, then the code reads and writes hdf5 just fine. But if the restart file is not hdf5, then it fails. netcdf should be able to convert formats automatically on read, but it doesn't in the case. I'm pretty sure this does work fine on derecho, it reads a classic restart and writes hdf5. So why does this seem to only be a problem with a spack build when even the spack build supports hdf5 read and write fine (just not mixing).
In general, the code supports reading and writing different restart formats (with some limits). But this should be allowed, right?
You are on it Tony :)
The only difference I can think of is possibly the derecho configuration of netcdf-c includes 'parallel_netcdf' support ? Which maybe is needed for parallel read of netcdf classic files but not netcdf4 files.
OK. I'd be curious to see whether my suggestion of changing the ice_pio_init call for the restart read changes anything. Maybe if we don't force the read to be hdf5 (if we're even doing that), that will read the classic and hdf5 while still writing hdf5 or other formats. I'm testing that change on derecho and it works fine both ways on derecho. Just wondering whether it'll work around the spack problems.
OK. I'd be curious to see whether my suggestion of changing the ice_pio_init call for the restart read changes anything. Maybe if we don't force the read to be hdf5 (if we're even doing that), that will read the classic and hdf5 while still writing hdf5 or other formats. I'm testing that change on derecho and it works fine both ways on derecho. Just wondering whether it'll work around the spack problems.
I hardcoded line 236 in ice_pio to use PIO_IOTYPE_NETCDF
for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.
I hardcoded line 236 in ice_pio to use
PIO_IOTYPE_NETCDF
for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.
But that means you only wrote classic files? Can you try changing just the ice_pio_init call on the restart read as suggested above. I'm curious if that supports reading different file types and writing hdf5 (in parallel?). If so, that might be a good solution. I think if we can only read serially, that's probably an OK trade off for flexibility. As long as the writes are in parallel using hdf5 or pnetcdf. Performance is generally more of an issue for output as there is usually more output than input.
I hardcoded line 236 in ice_pio to use
PIO_IOTYPE_NETCDF
for read operations, and it will then read both netcdf classic and netcdf4 initial conditions files. But this means all reads of netcdf files will be single PE only.But that means you only wrote classic files? Can you try changing just the ice_pio_init call on the restart read as suggested above. I'm curious if that supports reading different file types and writing hdf5 (in parallel?). If so, that might be a good solution. I think if we can only read serially, that's probably an OK trade off for flexibility. As long as the writes are in parallel using hdf5 or pnetcdf. Performance is generally more of an issue for output as there is usually more output than input.
Yeah - I only hardcoded the type for read operations. The output was still netcdf4 type.
Most likely parallel reads are not that important ... I don't have a good config to test that on at the moment. The other option would be try the parallel read, and if that fails, try a read for netcdf classic.
I will try and build the dependencies with pnetcdf enabled, well see!
I think the most important thing is that it works robustly at this point. I think we should defer read performance, make sure it's not aborting when it probably shouldn't, and make sure the write performance is working as expected with netcdf, hdf5, pnetcdf and in parallel when we expect it should. It might be even that is too high a bar, keep me posted.
Catching up here....I've now switched over to Derecho using the spack-stack built for UWM there. I have the same issue: using hdf5 as a restart_format fails when using a classic restart file. Converting it to a netcdf4-classic works for restart_format=hdf5. I'm now trying the code change that Tony suggested above for our build.
In terms of general solutions, I think yes, read performance is not as critical. I think suggestion 3 above seems a good compromise. Are we sure that would work in all cases?
- Retry opening in ice_pio.F90. If the read fails with the requested iotype, retry with iotype="cdf1".
Tony's modification to remove fformat=trim(restart_format) from the ice_pio_init on read works in my test case. I can read a classic restart when the format is set to hdf5.
What is happening under the covers when we remove the fformat? If you give it a netcdf-4 file, it will still read in parallel?
I don't know enough about how hdf5 works to understand whether it's always reading in parallel or not. Plus we have the PIO interface, and I'm not totally sure whether there are some tricks happening under the covers there. Based on the current results, my speculation is as follows. If you open a file hdf5 thru PIO, it expects to read/write in parallel. If the file to be read is NOT an hdf5 file, then it can't read in parallel and an abort is triggered. So if you open a file as cdf1 thru PIO, it will read in serial and can read any filetype. The change I propose set the file open to cdf1 regardless of the setting of restart_format. That provides flexibility, still allows files to be written in hdf5 in parallel, but reads hdf5 in serial. That all just a guess at what's happening.
I think it makes sense to check in the change I proposed. I've tested it on Derecho (with non-spack builds of pio). Does anyone disagree?
@DeniseWorthen, do you think it's worth trying to set up standalone CICE testing using the spack builds? Would that be difficult? Also, maybe I'll try to generate different formats for the restart input files, so we add some test coverage for that.
I've been wondering--Is there any way to use the netcdf inquire format function (nf_inq_format) to determine what to do for restart reads?
Add-on: Regarding spack-stack, I don't really know how hard this might be. Theoretically it seems you might be able to
module use /glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load ufs_derecho.intel
to provide all the same modules we use to build?
Sorry for the Friday afternoon explanation ! ... Here goes. I will revisit Monday if needed :)
If you open a file hdf5 thru PIO, it expects to read/write in parallel. If the file to be read is NOT an hdf5 file, then it can't read in parallel and an abort is triggered. .
So the interesting thing here is, if the parallel read fails, the pio library should retry with a serial read and netcdf classic (aka PIO_TYPE_NETCDF). see link
I messed around for long enough to get extra PIO logging turned on (see cice.runlog.240209-153008.txt), and saw that the iotype when it retries to open the file using a serial open hadn't been updated. e.g.: logs like this:
1 retry nc_open(/g/data/ik11/inputs/CICE_data/ic/gx3/iced_gx3_v5.nc) : fd = -1, iotype = 4, do_io = 1, ierr = -128
Where nc_open
is the serial read that is the retry, but the iotype = 4 (should be PIO_IOTYPE_NETCDF=2)
The parallelio changes this iotype before retrying:
/* reset file markers for NETCDF on all tasks */
file->iotype = PIO_IOTYPE_NETCDF;
I suspected that the variable file
which is shared from fortran to the c binding, wasn't getting updated. So I made a local 'lFile' within ice_pio_init, and it fixed the problem! (See https://github.com/CICE-Consortium/CICE/commit/3ef71dd0c5f9cb7cac9983fd746c085b585bcbc3). I don't really understand what is going on here, but maybe you will Tony?
@DeniseWorthen, do you think it's worth trying to set up standalone CICE testing using the spack builds? Would that be difficult? Also, maybe I'll try to generate different formats for the restart input files, so we add some test coverage for that.
I have done this on Gadi - it was straightforward enough. Spack produces modulefiles, so as Denise says, you just need to module use
and module load
the modules produced via spack, rather than the default ones.
Thanks @anton-seaice for all this work, very good sleuthing. So how should we handle things? Do you want to create a PR to the branch with something like https://github.com/CICE-Consortium/CICE/commit/3ef71dd0c5f9cb7cac9983fd746c085b585bcbc3 or should I update the branch based on your changes? Should we go with my solution which just sets the restart read to cdf1? I don't understand the fix, is whatever is going on here something we need to worry about elsewhere in PIO? We haven't run into many c/fortran issues with PIO, so I assume we're OK, but this is weird.
Separately, I'll try to get a spack thing going on Derecho, but maybe as a separate PR at a later time. It would be good to get this PR sorted out and merged if we can.
Ok. There’s someone here I can ask about this sort of stuff, I’ll try that on Monday and see if it makes sense to him. Otherwise maybe to be safe we should just hardcode it to always do the serial read as you suggest.
I also don't fully understand whether the issue is how we're using the PIO interface vs an issue in PIO itself.
I think it would be fine to commit this with forcing restart read to be serial and creating a followup issue (?). The PR already represents a substantial body of work.
EDIT: Also OK w/ waiting until Anton can ask his expert.
I think it would be fine to commit this with forcing restart read to be serial and creating a followup issue (?). The PR already represents a substantial body of work.
I think this is a good plan. I will update, test, and push some changes onto the branch today, nothing particularly groundbreaking. I will also create an issue. If we decide to pursue other options next week on this branch, that's great. But it could also be reconsidered for a PR in the future if we decide to do that.
I just updated the branch with a couple changes, https://github.com/CICE-Consortium/CICE/pull/928/commits/4f09d1f2c0abfa0c1ad499aed8d55695bdce1f06. Will run a full test suite this weekend and report/fix any problems found (I don't expect any).
I also added #932 and #933 in case we defer those issues.
Unless we have some quick resolution as to why or what is happening wrt #932, I recommend we merge what we have now (after final review) early next week and address #932 and #933 separately.
I just updated the branch with a couple changes, 4f09d1f. Will run a full test suite this weekend and report/fix any problems found (I don't expect any).
I also added #932 and #933 in case we defer those issues.
Thankyou - sounds good.
Unless we have some quick resolution as to why or what is happening wrt #932, I recommend we merge what we have now (after final review) early next week and address #932 and #933 separately.
Yes ok. I raised https://github.com/NCAR/ParallelIO/issues/1985 and will updated in #932 as needed.
Testing is complete and everything looks good. Anyone have any fine review comments before merging? Could a couple folks approve the PR?
This is pretty extensive and I am not sure when I can properly review this. I am at a meeting in Santa Fe this week. Doing some scouting for the CICE Consortium workshop in May. :)
I've been testing this, and I'm wondering if I understand this correctly. I see in my ice_diag.d file
(ice_pio_init) nprocs = 720
(ice_pio_init) pio_iotype = 1
(ice_pio_init) iotasks = 179
(ice_pio_init) baseroot = 1
(ice_pio_init) stride = 4
(ice_pio_init) nmode = 32
(ice_pio_init) create file ./RESTART/iced.2013-04-01-10800.nc
Is the number of iotasks actually 179? Or does "1" get added downstream in the PIO interface to make it 180? That is, iotasks is set here as if the counting starts at 0, so 0:179.
I've been testing this, and I'm wondering if I understand this correctly. I see in my ice_diag.d file
(ice_pio_init) nprocs = 720 (ice_pio_init) pio_iotype = 1 (ice_pio_init) iotasks = 179 (ice_pio_init) baseroot = 1 (ice_pio_init) stride = 4 (ice_pio_init) nmode = 32 (ice_pio_init) create file ./RESTART/iced.2013-04-01-10800.nc
Is the number of iotasks actually 179? Or does "1" get added downstream in the PIO interface to make it 180? That is, iotasks is set here as if the counting starts at 0, so 0:179.
iotasks is the number of tasks. So it is 179. What did you set in namelist? If you set iotasks to 180, I think it should use 180. But there also seems to be a problem in PIO that causes an abort with certain iotask counts in some cases when the iotasks is the "maximum", so I have coded around it. There is a whole section in icecore/cicedyn/infrastructure/io/io_pio2/ice_pio.F90 that sets the pio pe setup, lines 141-168. You can see at line 163 a comment about this. I end up reducing the iotasks to avoid this problem. The right equation is line 164 and that could be applied at line 144 as well. Again, had some aborts when I used line 164 even though it should work.
If you formally set iotasks to 180, does it work OK. Would it be better to use the right equation, giving the full iotasks and then sometimes finding the model aborts?
@apcraig I am using all default settings so far. In this case, I am specifying (nprocs=720):
restart_format = 'pnetcdf5'
restart_iotasks = -99
restart_rearranger = 'default'
restart_root = -99
restart_stride = -99
restart_chunksize = 0,0
restart_deflate = 0
Edit: one of the reasons I asked is if I want to set it manually, I was wondering if I needed to account for a "1" somewhere.
PR checklist
Closes #914 Closes #862
This provides new features for CICE IO both thru netCDF and PIO. New namelist are added to control history and restart format, hdf5 compression and chunking, the PIO rearranger, and PIO IO task control. Separate controls are provided for history and restart files. The namelist changes are for
_history_format, restart_format history_rearranger, restart_rearranger history_iotasks, history_root, history_stride, restart_iotasks, restart_root, and restart_stride history_chunksize, history_deflate, restart_chunksize, restartdeflate.
In particular,