Closed ndkeen closed 7 years ago
Yes, I can try, but I admit, I'm not sure how?
in your run directory you should have a streams.ocean file. At the top should be something like
<immutable_stream name="mesh"
type="none"
io_type="pnetcdf,cdf5"
filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
/>
<immutable_stream name="input"
type="input"
io_type="pnetcdf,cdf5"
input_interval="initial_only"
filename_template="/lustre/atlas1/cli900/world-shared/cesm/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
/>
to test this, change the path in both filename templates to point to the ocean restart I mentioned above. It should be identical for both of these streams. Once you make these changes copy the streams.ocean file to SourceMods/src.mpaso folder in your case directory and submit the test.
OK, I copied run/streams.ocean to my.streams.ocean, made the change:
cori05% diff run/streams.ocean my.streams.ocean
6c6
< filename_template="/project/projectdirs/acme/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
---
> filename_template="/global/cscratch1/sd/lvroekel/mpaso.rst.0039-01-01_00000.nc"
12c12
< filename_template="/project/projectdirs/acme/inputdata/ocn/mpas-o/oRRS18to6v3/oRRS18to6v3.170111.nc"
---
> filename_template="/global/cscratch1/sd/lvroekel/mpascice.rst.0039-01-01_00000.nc"
And then just cp my.streams.ocean SourceMods/src.mpaso/streams.ocean
?
Hi @vanroekel , I am trying this on Titan. log.ocean.###.err files are being generated containing error messages like:
ERROR: Warning: abs(sum(h)-bottomDepth)>2m. Most likely, initial layerThickness does not match bottomDepth.
Ignoring that, model initialization dropped from 15 minutes to 3 minutes with this change in the initial condition file. The file read is now using the pnetcdf io type (instead of the netcdf4p io type), and is "fast". Looks like this solves the problem on Titan. Hopefully @ndkeen will see the same type of improvement on Cori-KNL.
@worleyph that error message means the initial sea surface height is unphysical. It may indicate that variables are read in incorrectly, or initialized to zero. Are there any statistics written to check that the model ran normally? Perhaps an output file written at the end that can be compared with a previous run.
@ndkeen you are right on the process assuming you are going to your casedirectory/SourceMods/src.mpaso
@mark-petersen, I won't be able to look at this for a few days. Perhaps someone else can look at this.
@worleyph Can you give me the path of your run directory? I can check if results are valid.
Sorry I'm still having trouble verifying that I've done this right and it's working -- should be soon Ah, OK, the run I submitted last night seems to have failed with a segfault. I have been trying to clean up my additional timing/writes, but I don't think it would cause a fault here. I will try again.
Would I also need to edit streams.cice?
0000: forrtl: severe (174): SIGSEGV, segmentation fault occurred
0000: Image PC Routine Line Source
0000: acme.exe 0000000001DF7561 Unknown Unknown Unknown
0000: acme.exe (deleted 0000000001DF569B Unknown Unknown Unknown
0000: acme.exe 0000000001D9EEE4 Unknown Unknown Unknown
0000: acme.exe 0000000001D9ECF6 Unknown Unknown Unknown
0000: acme.exe (deleted 0000000001D21D06 Unknown Unknown Unknown
0000: acme.exe 0000000001D2D716 Unknown Unknown Unknown
0000: acme.exe (deleted 0000000001A0A490 Unknown Unknown Unknown
0000: acme.exe 0000000001E3F001 Unknown Unknown Unknown
0000: acme.exe 0000000001D6DBC7 Unknown Unknown Unknown
0000: acme.exe 0000000000E1839C ocn_vmix_mp_ocn_t 458 mpas_ocn_vmix.f90
0000: acme.exe 0000000000E1372D ocn_vmix_mp_ocn_v 572 mpas_ocn_vmix.f90
0000: acme.exe 0000000000B3D5F4 ocn_time_integrat 1721 mpas_ocn_time_integration_split.f90
0000: acme.exe 0000000000B33AA7 ocn_time_integrat 123 mpas_ocn_time_integration.f90
0000: acme.exe 0000000000B250B7 ocn_comp_mct_mp_o 916 ocn_comp_mct.f90
0000: acme.exe 000000000042AF5B component_mod_mp_ 705 component_mod.F90
0000: acme.exe 000000000040FBCB cesm_comp_mod_mp_ 3406 cesm_comp_mod.F90
0000: acme.exe 000000000042AC62 MAIN__ 68 cesm_driver.F90
0000: acme.exe (deleted 000000000040BB0E Unknown Unknown Unknown
0000: acme.exe 0000000001E17A20 Unknown Unknown Unknown
0000: acme.exe 000000000040B9F7 Unknown Unknown Unknown
@mark-petersen , the run directory is /lustre/atlas/scratch/worley/cli115/g18*pgi3/run . Note the wildcard in the case name - can't remember rhe full name. Think that it is open, but the log files typically are not readable by others, so probably won't help. I won't be able to get on and change permissions until this evening. Sorry.
Running again, I can verify that it is beyond OCN init much quicker than it was before. Still need to get a complete timing value.
OK, it looks like the OCN init time has drastically dropped. However, I still see the same seg fault as posted above which happens AFTER the model init is complete. The line in question is:
deallocate(A, B, C, tracersTemp)
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01printsd
@ndkeen that is very strange. The deallocate line is very standard:
vi src/core_ocean/shared/mpas_ocn_vmix.F
real (kind=RKIND), dimension(:), allocatable :: A,B,C
real (kind=RKIND), dimension(:,:), allocatable :: tracersTemp
...
allocate(A(nVertLevels),B(nVertLevels),C(nVertLevels),tracersTemp(num_tracers,nVertLevels))
...
deallocate(A, B, C, tracersTemp)
It's hard to imagine the problem. I looked for this error message in your directory but didn't find it. Did it give any clue about the cause of the error? Is this run with debug on, or openMP on?
@ndkeen and @mark-petersen - keep in mind this is also the last executable statement in the routine and sometime will get tagged for a seg fault for anything on the return from the subroutine. So it could be deallocate, but it could also be something else...
OK. It should be the case, that if I go back to using the older (slower) netcdf files in streams.ocean, it will simulate one day. Any other ways in which changing those two netcdf files might "conflict" with something else? Does streams.cice also need to be edited?
Another attempt (where I adjusted PE layouts) also failed in what looks like same location -- stack traces are a little more noisy. case dir here:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n050t01printse
I guess Pat's run with that change did not cause a problem.
I also have a clean checkout of current master and trying same launch script WITH the edited streams.ocean file.
@mark-petersen : you said you couldn't find error message? Should be here:
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m35wit-may18.n066t01printsd/run/acme.log.170601-095013
Using master as of today, I tried an attempt using the edits to streams.ocean and I get the same error (same location).
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m38-jun1.n050t01
This one uses 50 nodes (32 OCN and 15 ICE/CPL) to help get in Q faster.
Try returning the error from allocate/deallocate. Like this:
integer:: ier
deallocate(A, B, C, tracersTemp, stat=ier)
Although as Phil said the deallocate itself is probably not the problem.
OK, I'm trying what Rob suggested (making the change in two deallocates that are similar). I'm also trying with a rebuild using debug flags. I'm also trying with the "slow" files to verify that still works (with jun1st master).
Using deallocate(A, B, C, tracersTemp, stat=ier)
did not have any different outcome.
And fwiw, I also just tried G60to30.T62_oEC60to30v3.GMPAS-IAF
to make sure restarts work. They do.
@ndkeen I think Rob's suggestion on adding the error flag to deallocate was to also check the value of ier to see if the issue was actually an error deallocating or whether the error was just flagging the last statement of the routine due to a seg fault on exit.
OK, I am trying that.
I also see a failure with GNU. I built with some debug flags and using ATP I get a different-looking stack. I'm not certain this is the same issue or not.
ATP Stack walkback for Rank 1622 starting:
_start@start.S:122
__libc_start_main@libc-start.c:285
main@0x40bb0d
MAIN__@cesm_driver.F90:63
cesm_comp_mod_mp_cesm_init_@cesm_comp_mod.F90:1184
component_mod_mp_component_init_cc_@component_mod.F90:239
ocn_comp_mct_mp_ocn_init_mct_@ocn_comp_mct.f90:529
ocn_core_mp_ocn_core_init_@mpas_ocn_core.f90:89
ocn_forward_mode_mp_ocn_forward_mode_init_@mpas_ocn_forward_mode.f90:280
ocn_time_integration_split_mp_ocn_time_integration_split_init_@mpas_ocn_time_integration_split.f90:2011
ATP Stack walkback for Rank 1622 done
The last line here is 2011 -- I assume it's a divide by zero? Copy/pasting the comments as it looks like someone was wary of layerThicknessSum being 0.
! thicknessSum is initialized outside the loop because on land boundaries
! maxLevelEdgeTop=0, but I want to initialize thicknessSum with a
! nonzero value to avoid a NaN.
layerThicknessEdge1 = 0.5_RKIND*( layerThickness(1,cell1) + layerThickness(1,cell2) )
normalThicknessFluxSum = layerThicknessEdge1 * normalVelocity(1,iEdge)
layerThicknessSum = layerThicknessEdge1
do k=2, maxLevelEdgeTop(iEdge)
! ocn_diagnostic_solve has not yet been called, so compute hEdge
! just for this edge.
layerThicknessEdge1 = 0.5_RKIND*( layerThickness(k,cell1) + layerThickness(k,cell2) )
normalThicknessFluxSum = normalThicknessFluxSum &
+ layerThicknessEdge1 * normalVelocity(k,iEdge)
layerThicknessSum = layerThicknessSum + layerThicknessEdge1
enddo
normalBarotropicVelocity(iEdge) = normalThicknessFluxSum / layerThicknessSum ! ndk line 2011
Compiling with GNU, I still see the error in the deallocate statement. To me, this points to a "memory error". With all of these runs there are a lot of stack trace noise and I try to find one that shows the finest detail. Note the write of ier=0, so this error must be the second time it goes thru function? And, yes, I see that it is failing at different location than the error using Intel above -- it could be two different things. Ah -- but I also built/ran with DEBUG GNU and that shows the same FP error as above.
0159: ocn_vel_vmix_tend_implicit deallocate ier= 0
0159: *** Error in `/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/acme.exe': corrupted double-linked list: 0x00000
00020779490 ***
....
0159: at /home/abuild/rpmbuild/BUILD/glibc-2.19/malloc/malloc.c:4029
0159: #6 0xdc462e in __ocn_vmix_MOD_ocn_tracer_vmix_tend_implicit
0159: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/shared/mpas_ocn_vmix.f90:459
0159: #7 0xdc58a5 in __ocn_vmix_MOD_ocn_vmix_implicit
0159: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/shared/mpas_ocn_vmix.f90:574
0159: #8 0xa2cdc0 in __ocn_time_integration_split_MOD_ocn_time_integrator_split
0159: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.agnu/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_time_integrat
ion_split.f90:1721
...
Here is stack for DEBUG GNU build:
0159: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
0159:
0159: Backtrace for this error:
0159: #0 0x1aff48f in ???
0159: at /home/abuild/rpmbuild/BUILD/glibc-2.19/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0159: #1 0xd7f8e5 in __ocn_time_integration_split_MOD_ocn_time_integration_split_init
0159: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.bgnudebug/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_time_integration_split.f90:2011
0159: #2 0x143c6e7 in __ocn_forward_mode_MOD_ocn_forward_mode_init
0159: at /global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_gnu.m38-jun1.n050t01.bgnudebug/bld/ocn/source/core_ocean/mode_forward/mpas_ocn_forward_mode.f90:280
I think I see a problem with what I've been trying. The editing of the streams.ocean file and using the SourceMods directory is new to me. I did copy/paste what I did & ask about cice. I now suspect that what you want me to try is changing only streams.ocean and ONLY using the new mpaso restart file -- replacing this string in two places -- one for mesh and one for input. NOT what I did, which was to use the mpascice rst file in one of those. I'm trying again -- should run in debug.
To follow up on the deallocate error flag, yes after each line where you return the error, add a line like:
if(ier /= 0) write(0,*) "deallocate error", ier
Trying the correct edit to streams.ocean, I can now get the job to complete one day. The total time in Init is now 459 secs. This is for a 50-node run, where OCN has 32 nodes. Previous runs with this layout were measuring 6500 seconds for Init. So this is a 13x speedup. Also this is with June 1st master.
Sorry about the mistake, I acted too quickly. I saw 2 new files and 2 locations that needed to be changed, so that's what I did.
Should we copy these files into /project and make this change?
So you've reduced spin-up time from nearly 2 hrs to 8 min? NICE!
Well, the "Init Time", yes. And that was for 50-nodes. Previously, the Init time was somewhat scaling with nodes, so I'd like to also see a small Init time when I use more nodes.
I had several other jobs stop after running out of time -- and then realized I still had DEBUG=TRUE on from a previous experiment. Resubmitting. But, it also looks like we no longer see the same FP issues I described above -- surely related to my error in streams.ocean.
My earlier suggestion to "copy the files to /project" -- I really meant to the servers -- and "is this an acceptable solution? or is this just to a test to verify this is the problem?"
@mark-petersen , I changed permissions so that directory /lustre/atlas1/cli115/scratch/worley/g18to6.T62_oRRS18to6v3.GMPAS-IAF.titan_pgi3/run is readable. Don't know whether this is permanent or not. It you try and can't get in, I'll next copy this elsewhere. I am going to run some more experiments in this directory, and moved the output of the previous run, with the error messages, to
/lustre/atlas1/cli115/scratch/worley/g18to6.T62_oRRS18to6v3.GMPAS-IAF.titan_pgi3/run/170601-014551
@ndkeen , did your run with the new ocean initialization file generate log.ocean.####.err files?
Yes, there are always many log.ocean files. All with the same message:
ERROR: Warning: abs(sum(h)-bottomDepth)>2m. Most likely, initial layerThickness does not match bottomDepth.
Think that this needs to be resolved before starting to use these files for ocean (and sea ice) initialization. Perhaps there is version skew between master and the version of the code that @vanroekel and @mark-petersen are using?
OK, sounds good.
For a hires coupled case that uses 319 nodes total (a smaller overall PE layout than reported above, though still uses 150 nodes each for ICE and OCN), the "Init" time dropped from 8015 seconds to 1103 seconds.
Can someone explain again what the issue was? The original files were in a different netcdf format that caused ACME to read them using a different algorithm which was very slow? Could there be other files where this might happen?
I recall at one point that there were some MPAS folks requesting pnetcdf 1.5 be installed on Cori. I was unable to convince NERSC to do that, but I can try to help get around whatever the issue was. We are currently using pnetcdf 1.7 and have 1.8 installed (which I tried, but nothing different happens).
I can now use the debug Q to run hires G-case tests (at least for one day with no restarts). I might start a different github issue regarding the failing restarts with hires G case.
@ndkeen and @worleyph for your restart failures, are you running G-cases? If so, can you check that PIO_TYPENAME = netcdf for all data components and the coupler? I had G case failure unless PIO_TYPENAME was changed for these components (see #1451 ). Note netcdf was the default for G cases in CIME5.1.
I moved the restart conversation to https://github.com/ACME-Climate/ACME/issues/1574
@ndkeen regarding pnetcdf, our restart file violates cdf2 file size constraints (> 4Gb for one variable). In pnetcdf/1.5.0 there was a bug that allowed us to write files in spite of this (there was no variable size check). If master is used, there is no need to have version 1.5.0. The catch is you need to use files produced by pnetcdf > 1.5.0 (the restarts I link to above). I can send the necessary streams changes for sea ice if you'd like to add the tests for that.
@vanroekel , this failure is when creating and writing to a new restart file, not writing an old one?
yes, this was in writing a new cpl restart file.
Still confused. I am usiing both your ocean and sea ice restart files, and the problem is showing up in cpl restart write. Not sure what you are telling me. Or this has nothing to do with the cpl restart write error?
For me, in a G-case, when I set PIO_TYPENAME = netcdf
for cpl in env_run.xml, I can successfully write coupler restarts
Right -- I also see the restarts working in G case when I do ./xmlchange CPL_PIO_TYPENAME=netcdf
, which I note in github #1574. A coupled test has yet to run.
FYI, I also updated the streams.cice file to point to the file Luke provided and tried another hires G case. It runs OK, but I don't see any noticeable performance diff AND I still get the log.ocean* files. Luke mentioned there might be another mod to make in one of these files to make it more consistent.
@ndkeen - is this issue resolved?
Yes. Now there may still be an issue that the PIO code is not waving its arms more frantically when it drops down to a lesser (serial) method of reading large files. However, using MPAS restart files that were written with a newer version of pnetcdf seems to have fixed things for us and it's better going forward.
Generic issue, but wanted a place to store notes and we will surely have some code changes to help diagnose/address.
I'm running on cori-knl, but I understand others have seen similar slow init times. I'm still trying to figure out if this is highly dependent on number of total MPI's, number of MPI's for a given component, other settings, etc. The following is for a G-case where I'm giving 150 nodes to OCN and 150 to ICE+CPL. On each node, using 64 MPI's (pure MPI mode) for a total of 9600 tasks on each of the MPAS components. I also see even slower init times for coupled hi-res problem that is using similar MPAS setup (afaik), the same number of nodes to each component, but more total nodes in the job -- however, as it fails in restart, I don't have complete timing files.
The timer
comp_init_cc_ice
reports 1467 which would indicate most of the time in OCN init. Ando_i:PIO:pio_read_nfdarray_double
has 898 seconds from the following file:/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01/run/timing.170519-000252/model_timing.00000
Assuming the value of
pnetcdf
is being honored forPIO_TYPENAME
that is set in env_run.xml, then it looks like this is the only call that could be causing the time:I have some MCT files that @worleyph has modified to add more timers (though I need to backport that with the incoming MCT 2.0 files). I have jobs in the Q currently.
/global/cscratch1/sd/ndk/acme_scratch/cori-knl/g18to6.T62_oRRS18to6v3.GMPAS-IAF.cori-knl_intel.m34-may17.n302t01
fyi, when I tried to a 32-node hi-res problem, giving OCN 15 nodes and ICE+CPL 15 nodes, I did a top on each compute node while it was in init. The first 15 nodes are using about 53 GB, while the remaining nodes are using 5GB.