E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 360 forks source link

what ocean and sea ice initialization files should we be using? #1580

Closed worleyph closed 7 years ago

worleyph commented 7 years ago

This question first occurs in issue #1547 (posed by @ndkeen ) , but wanted to give it its own home. This is a question for the science and coupled groups, but impacts the performance group (among others).

The current ocean (and sea ice) iniitalization files for the water cycle compsets are not recognized as pnetcdf files within PIO, and the netcdf file reads used instead take a long time, making initialization very expensive. The ocean group provided new files that do not have this problem, but these files generate warnings (both for G and water cycle compsets) of the form

 ERROR: Warning: abs(sum(h)-bottomDepth)>2m. Most likely, initial layerThickness does not match bottomDepth.

when used with current master.

Is there a timeline for updating the water cycle compset definitions to use "true" pnetcdf files? And do the warnings about the ocean initial data impact the performance of the model? Would hate to be using data for these runs to optimize they code performance when they are not representative of production runs.

rljacob commented 7 years ago

@jayeshkrishna is PIO somehow mis-identifying the file? I didn't think "pnetcdf" was a file format. There is cdf-2 or cdf-5, etc.

worleyph commented 7 years ago

Repeating relevant comments from issue #1547:

In the PIO funtion open_nf (in file ionf_mod.F90) is logic to try to open a file using a given PIO type. If this fails, then the PIO type is changed. For oRRS18to6v3.170111.nc, the PIO file type starts as '5' (== PIO_iotype_pnetcdf). However, the error return from

ierr = nfmpi_open(File%iosystem%IO_comm,fname,amode,File%iosystem%info,File%fh)

is -51, and the PIO type is then changed to '8' (== pio_iotype_netcdf4p).

... The error (-51) is

PARAMETER (NF_ENOTNC = -51) ! Not a netcdf file

(but this file can then be opened and read with pio_iotype_netcdf4p option).

ndkeen commented 7 years ago

Thanks Pat. Just adding a note: @vanroekel told me at the all-hands meeting we would give me a streams file (hopefully both a ocean and ice) that he thought was consistent. What I thought I heard Luke say was that if we were going to use files such as these (in streams.ocean and streams.cice):

/global/cscratch1/sd/lvroekel/mpaso.rst.0039-01-01_00000.nc
/global/cscratch1/sd/lvroekel/mpascice.rst.0039-01-01_00000.nc

which seem to allow faster init times (at least for OCN), then there is another filename entry in one of those files that must be changed for consistency and potentially avoiding errors.

worleyph commented 7 years ago

Thanks for the update. The coupled group should also weigh in, and hopefully change the compset definitions. Faster initialization (to this degree) is also important for production runs.

PeterCaldwell commented 7 years ago

I 100% agree that we should be using pnetcdf files rather than netcdf because the performance impact of this choice is huge. I also agree that we need to make sure the new files we have are producing credible answers. Thanks for bringing this issue up, @worleyph and @ndkeen ! I don't know anything about which files would be correct vs wrong. @jonbob , @vanroekel , @mark-petersen - Could you talk amongst yourselves and assign someone to sort this out (as a "critical path" task?

toddringler commented 7 years ago

I might be wrong here .... but I think that ocean/ice team found that the choice of pnetcdf / netcdf depended on file size. @jonbob or @mark-petersen will be able to provide details.

jonbob commented 7 years ago

@toddringler is correct -- we've found that having large arrays in the IO requires us to use "pnetcdf,cdf5" for several of the ocn/ice streams. On some really small files, however, pnetcdf can be slower -- so we've moved to assigning pio types to individual files/streams. @PeterCaldwell , we have an issue to track this (#1451 that's been assigned to me -- so I'll also take it on as a "critical path" task.

worleyph commented 7 years ago

@jonbob and @toddringler , just to be clear, the problem is that the default ocean 18to6 inititalization file for the G and A_WCYCL compsets can not be read as a pnetcdf file (even using pnetcdf,cdf5) , as described in the first comment. An alternative file provide by @mark-petersen does work, but also generate many warnings when the model starts. As such, neither are appropriate for use in production runs, and something else is needed. Based on the above comment by @ndkeen, the alternative file can be made to work with some other changes in the streams.ocean file. Once this is confirmed, the coupled group should consider changing the definition of the relevant compsets.

worleyph commented 7 years ago

Correction - think that @vanroekel provided the alternative ocean (and sea ice) files - sorry for the misattribution Luke.

vanroekel commented 7 years ago

so a quick note of clarity here. There is no such thing as a pnetcdf file. The issue is how PIO interfaces with the netcdf file. The main issue here is whether the file is cdf-2 or cdf5. What was happening in @worleyph and @ndkeen runs was due to the ocean init file being a cdf-2 type, but with some variables that violate the cdf-2 variable size constraint. This was not an issue when using parallel-netcdf/1.5.0 as there was a soft bug that did not check variable sizes. When running with newer versions of parallel-netcdf the netcdf interface would be used to read init files as the pnetcdf variable size check returned an error on the original files. The newer files I have sent do not have this issue as they are cdf-5, which is not subject to variable size constraints.

@ndkeen I have placed new ocean and sea ice restart files you can use for your tests on KNL. I also have included a file with the necessary streams.cice changes for you. They are at

/global/cscratch1/sd/lvroekel/restart_files_for_noel

@worleyph these files are also on titan if you want to try as well at

/lustre/atlas1/cli115/scratch/vanroek/highres_restarts
worleyph commented 7 years ago

@vanroekel , thank you for clarifying and cleaning up some (my) sloppiness in the discussion. Again, I believe that it is important for the coupled group to be aware of this issue and to make sure that they are using ocean and sea ice initialization and restart files that are cdf5 files for high resolution runs. It is also important for the default initialization files for the G and A_WCYCL compsets for high resolution meshes to be cdf5 files. I do not know who should be responsible for this latter task, but it probably is not @ndkeen or myself.

worleyph commented 7 years ago

@vanroekel , with regard to (thanks for providing this)

 /lustre/atlas1/cli115/scratch/vanroek/highres_restarts

there is also a pointer to seaice.RRS18to6v3.restartFrom_titan0228.170321.nc in

 <stream name="landIceMasks"
         type="input"
         io_type="pnetcdf"
         input_interval="initial_only"
     filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ice/mpas-cice/oRRS18to6v3/seaice.RRS18to6v3.restartFrom_titan0228.170321.nc">

Should this also be replaced by mpascice.rst.0053-01-01_00000_noxtime.nc ?

ndkeen commented 7 years ago

I also had a question about the editing of the streams file.

I made a copy of the files and created the .rst53 streams file below:

cori12% pwd
/global/cscratch1/sd/ndk/mpas-restart-files
cori12% ls -l
total 35983660
drwxrwxr-x  2 ndk ndk        4096 Jun 15 20:26 ./
drwxr-xr-x 63 ndk ndk       20480 Jun 15 20:20 ../
-rw-r--r--  1 ndk ndk  8216369200 Jun 15 20:17 mpascice.rst.0053-01-01_00000_noxtime.nc
-rw-r--r--  1 ndk ndk 28630788832 Jun 15 20:19 mpaso.rst.0053-01-01_00000_noxtime.nc
-rw-r--r--  1 ndk ndk       11172 Jun 15 20:18 streams.cice.rst39
-rw-r--r--  1 ndk ndk       11542 Jun 15 20:26 streams.cice.rst53
-rw-r--r--  1 ndk ndk         718 Jun 15 20:19 streams.cice_additions
-rw-r--r--  1 ndk ndk       20122 Jun 15 20:18 streams.ocean.rst39
-rw-r--r--  1 ndk ndk       20166 Jun 15 20:21 streams.ocean.rst53

Can you verify I've made the correct edits? For the ocean, I simply used the mpaso rst53 file in 2 places. For the cice, I added the two entries you suggested, but a) one of them was already there and b) there was another that looks like it should also be changed to use mpascice rst53 files as well.

worleyph commented 7 years ago

@vanroekel , FYI - I am still getting the warnings ("Warning: abs(sum(h)-bottomDepth)>2m. ..."). Wasn't sure if the new restart files and changes to streams files was supposed to eliminate these or not.

jayeshkrishna commented 7 years ago

@vanroekel : Are mpaso.rst.0039-01-01_00000.nc and mpascice.rst.0039-01-01_00000.nc available on Titan ?

jonbob commented 7 years ago

@worleyph - The new files won't get rid of the warnings, but we should change the logging so it doesn't look like a fatal error.

worleyph commented 7 years ago

Luke gave me 0053 (not 0039) most recently. See the above comment from Luke as to the location of these files and example streams files on Titan.

jayeshkrishna commented 7 years ago

Does a run with 0053 files complete successfully (I am currently only looking at I/O performance, so warnings are ok for me as long as runs complete as expected)?

worleyph commented 7 years ago

@jonbob, do the warnings indicate that the simulation will not be realistic (performance-wise)? Mark Petersen was going to check on this at one point, but I never heard back what his conclusion was.

worleyph commented 7 years ago

@jayeshkrishna, yes the jobs run successfully using 0053 (and 0039). You'll have to ask Luke why he had us update to 0053.

jonbob commented 7 years ago

@mark-petersen will know more, but I don't believe it means the simulation is not realistic -- just has some inconsistencies in its initial conditions. I see those warnings in virtually every run lately. I'll open an MPAS issue so we don't forget about tracking this down.

vanroekel commented 7 years ago

@jonbob and @worleyph I have seen these warnings as well and I have seen no visible issues with the simulation. I don't recall seeing it frequently. The last time I recall seeing this was when I updated to CIME53, which was a bit of a harder restart than usual. I had those warnings for 1 submission and then no issues since. @worleyph could you try a restart after your short test? i.e. do a 5-day simulation with warnings and then another 5-days with a restart and see if the warnings remain?

Regarding year 53, vs. 39. There is no performance reason for this. I thought I'd pull the latest restart as the ocean will have a more spun-up circulation. But I wouldn't expect any performance change.

worleyph commented 7 years ago
  @worleyph could you try a restart after your short test? 

I'm still traveling - home next week. Can't guarantee to do this until then. Perhaps someone else wants to volunteer.

My concern is based on @mark-petersen 's comment

 @worleyph that error message means the initial sea surface height is unphysical. It may indicate that variables are read in incorrectly, or initialized to zero.

and Mark was going to verify that the results from my run were reasonable.

ndkeen commented 7 years ago

I have been running with the new "rst.0053" files for G-case testing. run/streams.ocean and run/streams.cice are pointing to the 0053 files. a) I still get the errors about bottomDepth and b) I'm still not sure I edited the files correctly (see my last comment). c) I don't see any significant perf diffs (in init or simulation)

ndkeen commented 7 years ago

I just tried to use the restart files and got a failure. It ran and wrote restart files, then I changed CONTINUE_RUN=TRUE and resubmitted (is that correct way?). Here is the failure and case dir:

0448: forrtl: severe (154): array index out of bounds
0448: Image              PC                Routine            Line        Source             
0448: acme.exe           0000000001E79231  Unknown               Unknown  Unknown
0448: acme.exe           0000000001E7736B  Unknown               Unknown  Unknown
0448: acme.exe           0000000001E20B74  Unknown               Unknown  Unknown
0448: acme.exe           0000000001E20986  Unknown               Unknown  Unknown
0448: acme.exe           0000000001DA06F9  Unknown               Unknown  Unknown
0448: acme.exe           0000000001DAC8A3  Unknown               Unknown  Unknown
0448: acme.exe           0000000001762160  Unknown               Unknown  Unknown
0448: acme.exe           0000000000E0DE8F  ocn_init_routines         198  mpas_ocn_init_routines.f90
0448: acme.exe           0000000000EBA732  ocn_forward_mode_         278  mpas_ocn_forward_mode.f90
0448: acme.exe           0000000000D9D26C  ocn_core_mp_ocn_c          89  mpas_ocn_core.f90
0448: acme.exe           0000000000AE7C8F  ocn_comp_mct_mp_o         529  ocn_comp_mct.f90
0448: acme.exe           0000000000429E0E  component_mod_mp_         227  component_mod.F90
0448: acme.exe           0000000000419BD9  cesm_comp_mod_mp_        1191  cesm_comp_mod.F90
0448: acme.exe           0000000000426DC0  MAIN__                     63  cesm_driver.F90
0448: acme.exe           000000000040BB0E  Unknown               Unknown  Unknown
0448: acme.exe           0000000001E99920  Unknown               Unknown  Unknown
0448: acme.exe           000000000040B9F7  Unknown               Unknown  Unknown

/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n065t02.m64.st53P-restart

jonbob commented 7 years ago

@ndkeen - something went wrong. The ocean error logs are all saying:

ERROR: Stream 'restart' attempted to read non-existent file 'mpaso.rst.0001-01-06_00000.nc'
ndkeen commented 7 years ago

OK, I may have tried to restart incorrectly. What are the appropriate steps?

jonbob commented 7 years ago

@ndkeen - I checked your directory and there is no ocean restart file? But I think you followed the correct steps. Can you redo the original run (meaning set CONTINUE_RUN back to false) and let's make sure all the restart files get output correctly?

ndkeen commented 7 years ago

OK, will do. Not sure what happened as I set REST_OPTION/REST_N to be same as STOP_OPTION/STOP_N:

    <entry id="REST_OPTION" value="ndays">
    <entry id="REST_N" value="5">

But I admit I'm not sure I verified the restart files were there as I had tested it previously.

jonbob commented 7 years ago

I think you did everything correctly. I'm trying to see if something got messed up, but the ocn and acme logs both closed successfully. So nothing is jumping out -- other than running it again and making sure the restart file shows up. And I don't check for them either, so who knows? IO at high-res is just problematic...

ndkeen commented 7 years ago

As Cori scratch will start enforcing purgingsoon, I've been trying to backup to tape. And before doing that, I have been very busy trying to remove large files from directories, so it's possibly I accidentally deleted the restart files. I can try in another directory that does have the restart files, but also restarted a fresh experiment just now.

It would also be nice if the code could let us know a little sooner that the restart files are missing.

jonbob commented 7 years ago

That would make some sense -- everything looked set up to correctly output the files. And no sign in any log of any problems. So yes, please try again and let me know

jonbob commented 7 years ago

And I'll see if it's possible for restart files to get included in the check_input_data. I'm not sure, because the codes have to parse the rpointer files to fully extract the filenames. But it's always hard to have issues with missing files after getting through the queues

ndkeen commented 7 years ago

I ran it again and here are the netcdf files I see:

-rw-rw-r-- 1 ndk ndk  8216369200 Jun 21 06:46 run/mpascice.rst.0001-01-06_00000.nc
-rw-rw-r-- 1 ndk ndk 28630788832 Jun 21 06:46 run/mpaso.rst.0001-01-06_00000.nc
-rw-rw-r-- 1 ndk ndk  3767851844 Jun 21 06:47 run/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n065t02.m64.st53P-restartb.cpl.r.0001-01-06-00000.nc

So it does appear that the larger one somehow got deleted which was certainly by accident as I was cleaning up.

Trying to restart from this.

xmlchange CONTINUE_RUN=TRUE
case.submit
jonbob commented 7 years ago

That looks correct, though you should also have some restart files for the data models. We'll cross our fingers

ndkeen commented 7 years ago

OK, it worked. Read in restarts, ran another 5 days, and exited. And wrote another set of restarts.

g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n065t02.m64.st53P-restartb

Sorry about the scare.

worleyph commented 7 years ago

Did ocean warning messages go away for the continuation run?

ndkeen commented 7 years ago

No they are still being written

worleyph commented 7 years ago

So, @vanroekel , @ndkeen 's restart test is a failure? Anything else we should try? Should we modify the code to output more details on the failure, e.g. what the values (being read in) that trigger the warning look like?

ndkeen commented 7 years ago

I did NOT remove them from the first run, but looking at the timestamps, most of them are written during the 2nd job. There are 32 log.ocean.*err files that have timestamp of "5:30" and 572 files with a stamp of "14:00". So I think they are still being written, but may not be exactly the same ranks.

jonbob commented 7 years ago

@worleyph - I don't think that means @ndkeen's test is a failure at all. I filed an issue in mpas to make these warnings appear as warnings instead of errors. We looked more at what spawns the errors/warnings, and it's minor differences between the sum of level thicknesses and the sea surface height. So as soon as we have some extra cycles, we'll:

Do you want me to point you at the corresponding mpas issue?

ndkeen commented 7 years ago

Note: I missed an important "NOT" in my last comment. I did not remove the log files between the runs.

worleyph commented 7 years ago

We looked more at what spawns the errors/warnings, and it's minor differences between the sum of level thicknesses and the sea surface height.

Thanks.

Do you want me to point you at the corresponding mpas issue?

No thanks.

I still do want someone to update the compset definitions so that we can use @vanroekel 's files as the defaults (and in production runs @golaz ?) so that we don't run into the slow ocean initialization problems again. Unless @jayeshkrishna can work out why the fallback PIO method is so slow (and make it faster), making this less important.

jonbob commented 7 years ago

OK, that's a simple AMCE thing. I'll try to do that today.

vanroekel commented 7 years ago

@jonbob I have a year 56 restart for an IC that you could use for acme if you'd like

/lustre/atlas1/cli115/world-shared/vanroekel/mpaso.rst.0056-01-01_0000.nc and mpascice.rst.0056-01-01.nc
worleyph commented 7 years ago

@jonbob , won't be BFB :-) . Will anyone else care about this?

jonbob commented 7 years ago

no - it's not being tested yet. Or ever...

jonbob commented 7 years ago

@vanroekel - thanks. Are those files already on the svn server? If so, I'll update to just point at them...

vanroekel commented 7 years ago

nope, I don't have push access to svn.

On Thu, Jun 22, 2017 at 10:42 AM, Jon Wolfe notifications@github.com wrote:

@vanroekel https://github.com/vanroekel - thanks. Are those files already on the svn server? If so, I'll update to just point at them...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ACME-Climate/ACME/issues/1580#issuecomment-310436268, or mute the thread https://github.com/notifications/unsubscribe-auth/ANB5UYgU9_MxsB2o_Ab4fGkVxzSi3dOmks5sGpl2gaJpZM4N1r_M .

jonbob commented 7 years ago

@vanroekel - but xtime has been yanked out?