exact restarts don't work when KPP is on

mark-petersen commented 9 years ago

Using tag v3.3, when I run 10 days versus 5+5 days with restart, I get bit-for-bit when cvmix is off. When cvmix is on with the namelist below, it is not bfb and there is a mismatch in digit 7 of global kinetic energy.

I see these variables in the restart file for KPP:

                        <var name="surfaceWindStress"/>
                        <var name="seaSurfacePressure"/>
                        <var name="boundaryLayerDepth"/>

and I confirmed that they are in the restart file for my bfb test.

My runs are at:

/lustre/scratch1/turquoise/mpeterse/runs/c45s  10 days
/lustre/scratch1/turquoise/mpeterse/runs/c45t  first 5 day
/lustre/scratch1/turquoise/mpeterse/runs/c45u  restart, second five day

and here is my namelist for cvmix:

&cvmix
    config_use_cvmix = .true.
    config_cvmix_prandtl_number = 1.0
    config_use_cvmix_background = .true.
    config_cvmix_background_diffusion = 1.0e-5
    config_cvmix_background_viscosity = 1.0e-4
    config_use_cvmix_convection = .true.
    config_cvmix_convective_diffusion = 1.0
    config_cvmix_convective_viscosity = 1.0
    config_cvmix_convective_basedOnBVF = .true.
    config_cvmix_convective_triggerBVF = 0.0
    config_use_cvmix_shear = .true.
    config_cvmix_shear_mixing_scheme = 'KPP'
    config_cvmix_shear_PP_nu_zero = 0.005
    config_cvmix_shear_PP_alpha = 5.0
    config_cvmix_shear_PP_exp = 2.0
    config_cvmix_shear_KPP_nu_zero = 0.005
    config_cvmix_shear_KPP_Ri_zero = 0.7
    config_cvmix_shear_KPP_exp = 3
    config_use_cvmix_tidal_mixing = .false.
    config_use_cvmix_double_diffusion = .false.
    config_use_cvmix_kpp = .true.
    config_cvmix_kpp_niterate = 2
    config_cvmix_kpp_criticalBulkRichardsonNumber = 0.25
    config_cvmix_kpp_matching = 'SimpleShapes'
    config_cvmix_kpp_EkmanOBL = .false.
    config_cvmix_kpp_MonObOBL = .false.
    config_cvmix_kpp_interpolationOMLType = 'quadratic'
    config_cvmix_kpp_surface_layer_extent = 0.1
/

douglasjacobsen commented 9 years ago

@mark-petersen @toddringler could we use the CVMix test case to debug this? That might be faster than using the real world stuff, but I think we need this before we can reliably use CVMix in ACME runs to spin up the larger simulations (like the EC meshes).

mark-petersen commented 9 years ago

@toddringler if you could evaluate bfb restarts in the column cvmix test case, that would be very helpful. I am spinning up ACME now, but working on other issues as well. If you could work on this next week, then I can hopefully have bfb restarts in ACME with KPP on.

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen Using a single-column CVMix-KPP test case, I obtain bfb after restarting using 1 proc and with 8 procs. The test case is configured to run 15 days. I restarted a run after 5 days. I compared boundary layer depth at end of day 15. OBL depth = 42.189838715417 in every simulation.

We previously had an issue with lack of bfb restartability when OBL was smoothed horizontally. I searched the code to see if that smoothing was still present. I did not find any code related to this smoothing.

I will evaluate bfb using the baroclinic test case, since it contains structure in the horizontal.

I will evaluate bfb using the name list you provide in the the CVMix-KPP test case.

Please confirm that you are using fixes provide in toddringler/MPAS/ocean/surface_layer_bug_fix . I can see how we might lose bfb without these fixes.

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen Using the name list you provided above in the single-column CVMix-KPP test case produces bfb restarts.

douglasjacobsen commented 9 years ago

Building on this, it might be good if we had a simple KPP test on the website as part of our downloadable test cases. Then I can add this to my testing suite to help ensure our use of CVmix is correct and tested as we continue.

I can almost guarantee that Mark's run is not using the bugfix you listed, since it's within ACME and I don't have that bugfix in the ACME branch I'm using. But that's good to know that it might be the fix we're looking for.

mark-petersen commented 9 years ago

Todd, Thanks for testing this so quickly.

I found this bug with ACME, but then tested yesterday with MPAS-O V3.3 ocean-only. When I repeated with surface_layer_bug_fix branch using QU.240km, 10 day and 5+5 day still have a mismatch in digit 7 of global avg KE.

I thought I tested this during the KPP review before V3.0, but I can't find it now, so maybe we've never had bfb restarts with KPP for global runs.

It is possible that there is some interaction with topography the prevents bfb restarts that is not revealed with the column test.

I will do a quick restart bfb test with the baroclinic channel using kpp to test this.

Mark

On 02/13/15 09:45, Doug Jacobsen wrote:

Building on this, it might be good if we had a simple KPP test on the website as part of our downloadable test cases. Then I can add this to my testing suite to help ensure our use of CVmix is correct and tested as we continue.

I can almost guarantee that Mark's run is not using the bugfix you listed, since it's within ACME and I don't have that bugfix in the ACME branch I'm using. But that's good to know that it might be the fix we're looking for.

— Reply to this email directly or view it on GitHub https://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74284267.

douglasjacobsen commented 9 years ago

@mark-petersen Can we run the overflow test case with KPP?

mark-petersen commented 9 years ago

The baroclinic channel makes more sense because it is stratified. Todd is doing it now.

On 02/13/15 10:01, Doug Jacobsen wrote:

@mark-petersen https://github.com/mark-petersen Can we run the overflow test case with KPP?

— Reply to this email directly or view it on GitHub https://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74287134.

douglasjacobsen commented 9 years ago

@mark-petersen sure, but if you're thinking it's an issue with topography, then the baroclinic channel should have B4B restarts.

mark-petersen commented 9 years ago

Good point. As long as KPP does not have an error without stratification, that would be a simple way to test topography.

On 02/13/15 10:08, Doug Jacobsen wrote:

@mark-petersen https://github.com/mark-petersen sure, but if you're thinking it's an issue with topology, then the baroclinic channel should have B4B restarts.

— Reply to this email directly or view it on GitHub https://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74288358.

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen using executable from toddringler/MPAS/ocean/surface_layer_bug_fix and name list provided above by Mark, the baroclinic channel is bfb. Ran 10 km version for 10 days, restarted after 5 days. stats_avg file is identical after 10 days.

I will now try the over flow test case with KPP.

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen using executable from toddringler/MPAS/ocean/surface_layer_bug_fix and name list provided above by Mark, the overflow test case is bfb. Ran 10 km / 40 layer for 18 hours, restarted after 9 hours, stats_avg file is identical at end of 18 hours.

just because I am on a roll, I will try the 120 km real-world configuration.

douglasjacobsen commented 9 years ago

@toddringler Thanks for doing this.

Just a note. If you don't want to run a long time, you can run for two time steps. After which, you can ncdiff the output files (for fields like temperature, layerThickness, and normalVelocity).

You should get zeros in the resulting fields. And if you want a python script to do this for you.... https://github.com/douglasjacobsen/dotfiles/blob/master/scripts/field_rms_errors.py

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen worldOcean_QU_120km is bfb. started from restart at year 10. ran 2 days. restarted after 1 day. stats_avg identical a end of day 2. restart files at beginning of day 3 are identical for temperature. used cvmix name list recorded provided by Mark. changed dt from 50 min to 30 min so I could hit day boundaries.

I am out of targets to fire at here.

mark-petersen commented 9 years ago

I must have done something wrong with the 240km. I will look again.

From: Todd Ringler notifications@github.com<mailto:notifications@github.com> Reply-To: MPAS-Dev/MPAS reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, February 13, 2015 11:05 AM To: MPAS-Dev/MPAS MPAS@noreply.github.com<mailto:MPAS@noreply.github.com> Cc: Mark Petersen mpetersen@lanl.gov<mailto:mpetersen@lanl.gov> Subject: Re: [MPAS] exact restarts don't work when KPP is on (#310)

@mark-petersenhttps://github.com/mark-petersen@douglasjacobsenhttps://github.com/douglasjacobsen worldOcean_QU_120km is bfb. started from restart at year 10. ran 2 days. restarted after 1 day. stats_avg identical a end of day 2. restart files at beginning of day 3 are identical for temperature. used cvmix name list recorded provided by Mark. changed dt from 50 min to 30 min so I could hit day boundaries.

I am out of targets to fire at here.

— Reply to this email directly or view it on GitHubhttps://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74297884.

toddringler commented 9 years ago

let me retest the 120 km. did not update the cvmix options properly.

toddringler commented 9 years ago

OK. i found a difference in bfb for global 120 km. let me dig into that a bit.

mark-petersen commented 9 years ago

Well, I hate to say that I'm glad to hear it, but I'm glad I didn't waste your morning chasing ghosts.

Mark

From: Todd Ringler notifications@github.com<mailto:notifications@github.com> Reply-To: MPAS-Dev/MPAS reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, February 13, 2015 11:17 AM To: MPAS-Dev/MPAS MPAS@noreply.github.com<mailto:MPAS@noreply.github.com> Cc: Mark Petersen mpetersen@lanl.gov<mailto:mpetersen@lanl.gov> Subject: Re: [MPAS] exact restarts don't work when KPP is on (#310)

OK. i found a difference in bfb for global 120 km. let me dig into that a bit.

— Reply to this email directly or view it on GitHubhttps://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74299779.

douglasjacobsen commented 9 years ago

So, these differences are only in the 120km? Can they be reproduced in the overflow as well?

toddringler commented 9 years ago

I went back and double checked the overflow .... it is bfb. i will check again, but right now i can only see it in the 120 km configuration. i have isolated it to the kpp section of cvmix.

toddringler commented 9 years ago

@mark-petersen @douglasjacobsen I traced the issue to this line of code call cvmix_kpp_compute_OBL_depth( CVmix_vars = cvmix_variables)

If I comment this line, I get bfb (but the OBL remains fixed in time)

I tried several fixed, e.g. calling the low-level version instead of the high-level version, looping over nCellsSolve, passing in (1:maxLevelCell(iCell)) instead of (1:nVertLevels). No luck.

None of this has anything to do with halo updates, so my guess is this is an indexing out of bounds that is showing itself in this configuration.

Maybe someone can run this with bounds checking and/or debugger?

douglasjacobsen commented 9 years ago

@toddringler could you try running without PBC's?

I can try to help out, but I might not get around to it until Tuesday.

toddringler commented 9 years ago

via

&partial_bottom_cells config_alter_ICs_for_pbcs = .false.

mark-petersen commented 9 years ago

Except that you need to start from a time zero initial condition, not a year 10 restart. The initial files are at

/turquoise/usr/projects/climate/mpeterse/grids_mpas/global/x1.120km/ocean.nc

and then use config_alter_ICs_for_pbcs = .true. **\ note true config_pbc_alteration_type = 'full_cell'

because the time zero IC has actual topography in it.

On 02/13/15 15:08, Todd Ringler wrote:

via

&partial_bottom_cells config_alter_ICs_for_pbcs = .false.

— Reply to this email directly or view it on GitHub https://github.com/MPAS-Dev/MPAS/issues/310#issuecomment-74335696.

mark-petersen commented 9 years ago

@toddringler the overflow test you already conducted used PBCs, so that is not the cause by itself. The standard test case on the overflow has:

&partial_bottom_cells
    config_alter_ICs_for_pbcs = .true.
    config_pbc_alteration_type = 'partial_cell'

mark-petersen commented 9 years ago

Update: testing on branch ocean/specify_obl_depth, now on ocean/develop and ocean/private.

The QU.240km is still not bit-for-bit across restarts, which is the same condition as on release V3.3. The error is small, in digit 7 after two days. It occurs with both split explicit and RK4.

I suspect that something occurs in a different order during the init process than during a normal timestep. For example, in mpas_init_block at line 499 of mpas_ocn_mpas_core.F, call ocn_diagnostic_solve(dt, statePool, forcingPool, meshPool, diagnosticsPool, scratchPool) has boundaryLayerDepth been computed the same way as at the end of a time step?

mark-petersen commented 9 years ago

@toddringler Another clue here: cvmix with KPP does not produce bit-for-bit identical results when run on two different partitions. When I turn cvmix off, it does produce bit-for-bit identical results. This may be an easier way to investigate this bug.

I am using the QU.240km on ocean/develop after this last merge (#325), my run t02p. I used ncdiff on the restart files and they are different.

douglasjacobsen commented 9 years ago

I'm spending some time trying to debug this, because we really need to fix this...

A note to anyone else looking at it, I've determined that the problem is related to velocity and cvmix. It also seems to only happen when: config_use_cvmix_kpp = .true.

MPAS-Dev / MPAS

exact restarts don't work when KPP is on #310