ESCOMP / CAM

Community Atmosphere Model
71 stars 133 forks source link

MPASA restart failure #874

Open jedwards4b opened 11 months ago

jedwards4b commented 11 months ago

What happened?

I noticed when testing ERS_Ln9.mpasa7p5_mpasa7p5_mg17.QPC6.derecho_intel.cam-outfrq9s that CLUBB was generating an error on restart: Error in advance_xp2_xpyp First this leads to an intolerable amount of output to stdout that will need to be addressed for high resolution runs.

Second I repeated this test with ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s and it fails with the same error.

What are the steps to reproduce the bug?

Just run the test: ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s

What CAM tag were you using?

cam6_3_119 - cam6_3_122

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

Yes, but I will need some help

Extra info

No response

adamrher commented 11 months ago

I don't know that this will resolve the restart failure, but it's relevant to the massive restart files. When we updated the clubb externals earlier this year, we switched the clubb pdf closure to after the clubb solver, which requires adding a bunch of high order moments, generated by the pdf closure, into the restarts. You can experiment with putting the pdf closure back, which I believe trims the restarts to the ~size they were in the old externals.

clubb_ipdf_call_placement=1

in user_nl_cam will switch it back. @Katetc can you verify whether this will trim the restarts?

jedwards4b commented 11 months ago

@adamrher - that didn't solve the problem, but thank you for the suggestion.

adamrher commented 11 months ago

OK. In case it helps, to trim the default i/o in h0 tapes, I usually remove all the aerosol/chem species via:

 history_chemistry              =             .true.
 history_chemspecies_srf                =       .true.
jedwards4b commented 11 months ago

I'm using empty_htapes=.true., this is purely a restart issue.

jedwards4b commented 10 months ago

I was able to run ERS_Ln9.mpasa120_mpasa120.QPC6.derecho_intel.cam-outfrq9s and ERS_Ln9.mpasa60_mpasa60.QPC6.derecho_intel.cam-outfrq9s successfully but ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s still gives the same error in spite of extreme changes in timestep:

mpas_dt = 45.0 ATM_NCPL: 80 NCPL_BASE_PERIOD: hour

fvitt commented 10 months ago

@jedwards4b What are the values of P0 and pbuf_time_idx in the restart files of these failing cases? In my high-res (ne120) waccmx runs on derecho these are zero in the restart file, while they should be 100000 and 1, respecively.

jedwards4b commented 10 months ago

@fvitt thanks - there is no P0 in the file but pbuf_time_idx is 0 where it is 1 in the lower resolution files.

jedwards4b commented 10 months ago

@fvitt can you provide instructions to reproduce the ne120 case, I would like to add it to my testing.

jtruesdal commented 10 months ago

I'll be interested to see the setup as well. Looking through a current set of regression tests it seems like we haven't been testing aquaplanet MPAS and I don't see an aquaplanet initial condition (APE). I tried configuring one using analytic atmospheric conditions but was getting an error failing to read u wind which should have been prescribed by setting analytic. That seems to be a bug. Aquaplanet should also work given an aquaplanet initial condition file with PHIS set to 0. If you are testing high resolution MPAS it might be more straight forward to running pure analytic (FHS94) or a full F case like F2000climo which we have initial condition files for and are routinely run as part of the regression tests.

jedwards4b commented 10 months ago

@jtruesdal Note that the lower resolution cases work and only the high-res cases fail - I can also print the value of eg pbuf_time_idx in the initial case and see that the value that should be written is correctly passed to PIO but the value in the file is 0. I think that this is a problem in the IO stack someplace and not in cam.

fvitt commented 10 months ago

@jedwards4b Clone this case: /glade/derecho/scratch/fvitt/fx2000_ne120pg3L273_test08

jedwards4b commented 10 months ago

@fvitt I found that the ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s case works when I increase the NTASKS which confirms that the problem is memory. I will continue to work on trapping this error but I would suggest that you try using a larger pelayout for your case. I see that you are currently using 7200 - it would be better to use a multiple of 128. Maybe NTASKS=12800?

jedwards4b commented 10 months ago

I was able to run and pass ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s restart tests on 4480 (35 nodes) and 1536 (12 nodes) tasks. The original case that failed had 512 tasks. Working on the theory that this is a memory issue I tried going back to NTASKS=512 but using fewer tasks per node. I tried MAX_MPITASKS_PER_NODE=64 (8 nodes) FAILS and MAX_MPITASKS_PER_NODE=32 (16 nodes) PASSES.

jedwards4b commented 10 months ago

I can reproduce this issue with: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel running on 128 tasks with REST_N=3,REST_OPTION=nsteps.

cacraigucar commented 6 months ago

@jedwards4b - Is this still an issue for you?

jedwards4b commented 6 months ago

I think that the question is - is it an issue for you? I would suggest that you rerun this test to see: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel

cacraigucar commented 23 hours ago

@briandobbins - @PeterHjortLauritzen suggested you might have some input on this