E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
73 stars 52 forks source link

Problems running e3sm_scream_v1_medres on summit #1610

Closed jgfouca closed 1 year ago

jgfouca commented 2 years ago

Cases: SMS_D_Ln2.ne30_ne30.F2000SCREAMv1 SMS_Ln2.ne30_ne30.F2000-SCREAMv1-AQP1

Error: cudaGetLastError() error( cudaErrorMemoryAllocation): out of memory /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_KernelLaunch.hpp:644

Potential solution: Use more nodes? Current config is:

CASE INFO:
  nodes: 4
  total tasks: 128
  tasks per node: 42
  thread count: 1
  ngpus per node: 0

...

jsrun -X 1 --nrs 24 --rs_per_host 6 --tasks_per_rs 7 -d plane:7 --cpu_per_rs 7 --gpu_per_rs 1 --bind packed:smt:1 ...

If I double tasks to 256, it runs longer and then crashes with a new error:

1: 18: terminate called after throwing an instance of 'std::logic_error'
1: 18:   what():  /gpfs/alpine/cli115/proj-shared/acmetest/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:250: FAIL:
1: 18: false
1: 18: Error! Failed post-condition check (cannot be repaired).
1: 18:   - Atm process name: Dynamics
1: 18:   - Property check name: ps within interval [40000, 110000]
1: 18:   - Atmosphere process MPI Rank: 18
1: 18:   - Error message: FieldWithinIntervalCheck failed; min = 90444.279867; max = 110738.319704
bartgol commented 2 years ago

I think we need to work on reporting mem usage on each rank into the atm log.

See #1611.

AaronDonahue commented 2 years ago

There error looks like surface pressure gets too big, but not by much. Would it be reasonable to bump up the max PS in the dynamics check?

bartgol commented 2 years ago

There error looks like surface pressure gets too big, but not by much. Would it be reasonable to bump up the max PS in the dynamics check?

I don't think we even have that check in master (at least I don't see it).

jgfouca commented 2 years ago

@bartgol , the sha1 I'm using:

commit 38f7d4e82f2a54397a2f383b89a2462d9cc7a853 (HEAD -> master, origin/master, origin/HEAD)
Merge: 2445ed6e7e 71978bebaa
Author: Autotester for E3SM related projects <56648600+E3SM-Autotester@users.noreply.github.com>
Date:   Thu May 5 07:06:32 2022 -0600

    Merge Pull Request #1606 from E3SM-Project/scream/bartgol/mach-env-setup-fix

    Automatically Merged using E3SM Pull Request AutoTester
    PR Title: Fix env setup in machine_specs.py
    PR Author: bartgol
AaronDonahue commented 2 years ago

https://github.com/E3SM-Project/scream/blob/master/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp#L399

The check is here.

bartgol commented 2 years ago

LoL. I fetched but forgot to update the branch. Sorry. :)

PeterCaldwell commented 2 years ago

Would it be reasonable to bump up the max PS in the dynamics check?

Yeah we could try that, but having a surface pressure of 1,107.8 mb seems pretty sketchy to me. @jgfouca - is this error for the debug compiled or the opt-build run you're doing? Maybe - like Perlmutter - Summit is behaving poorly in opt mode?

AaronDonahue commented 2 years ago

Yeah, a quick Google search shows the highest surface pressure on earth is 1085, close, but still...

jgfouca commented 2 years ago

@PeterCaldwell , this is for SMS_Ln2_P256.ne30_ne30.F2000-SCREAMv1-AQP1.summit_gnugpu , so not a debug case.

PeterCaldwell commented 2 years ago

Cool. And you also ran an SMS_D case too, right? Did that complete without errors?

jgfouca commented 2 years ago

@PeterCaldwell , the SMS_D case ran quite a bit longer but also failed:

1: 210: PIO: FATAL ERROR: Aborting... An error occured, Writing variables (number of variables = 1) to file 
(SMS_D_Ln2_P256.ne30_ne30.F2000SCREAMv1.summit_gnugpu.20220505_170816_s31lih.INSTANT.INVALID_x2.0001-01
-01.010000.r.nc, ncid=45) using PIO_IOTYPE_PNETCDF iotype failed. Non blocking write for variable (v_dyn, varid=1) failed
 (Number of subarray requests/regions=1, Size of data local to this process = 9216). NetCDF: Start+count exceeds dimension
 bound (err=-57). Aborting since the error handler was set to PIO_INTERNAL_ERROR... (/gpfs/alpine/cli115/proj
-shared/acmetest/scream/externals/scorpio/src/clib/pio_darray_int.c: 392)
PeterCaldwell commented 2 years ago

Ok cool. In atm.log, how many steps did SMS_D get? I assume 2 since that's the default number of steps between output writes...

bartgol commented 2 years ago

To be clear, that was a restart file write. v_dyn is only written for restarts file (for restarting dyn exactly). I don't know what the restart freq was though.

Edit: I didn't notice it's also clear from the nc file name, ending in .r.nc. By the way, did we recently update our version of scorpio? It's the second time I notice much more info in the error message (e.g., file name, and more).

PeterCaldwell commented 2 years ago

Oh, it makes so much more sense if that's a restart. So it sounds like Summit can't run in optimized mode (like #1557 for Perlmutter) and we also don't seem to be able to write restarts at ne30 on Summit. Can we write restarts at ne30 on Perlmutter, @ndkeen and @wlin7 ?

PeterCaldwell commented 2 years ago

@jgfouca - it would be worth setting the restart frequency to a huge number and seeing if ne30 SMS_D can run for a while if freed from writing restarts? Though maybe knowing how many steps it got would be useful before doing anything at all...

jgfouca commented 2 years ago

There wasn't much in the atm log. Maybe most of that time I was waiting was just queue time:

************** CXX SimulationParams **********************

   time_step_type: 5
   moisture: moist
   remap_alg: 10
   test case: 14
   ftype: 1
   theta_adv_form: 1
   rsplit: 2
   qsplit: 1
   qsize: 10
   limiter_option: 9
   state_frequency: 9999
   dcmip16_mu: 0
   nu: 3.4e-08
   nu_p: 3.4e-08
   nu_q: 3.4e-08
   nu_s: 3.4e-08
   nu_top: 250000
   nu_div: 3.4e-08
   hypervis_order: 2
   hypervis_subcycle: 1
   hypervis_subcycle_tom: 1
   hypervis_scaling: 3
   nu_ratio1: 1
   nu_ratio2: 1
   use_cpstar: no
   transport_alg: 0
   disable_diagnostics: no
   theta_hydrostatic_mode: yes
   prescribed_wind: no
   rearth: 6.376e+06

**********************************************************

[EAMXX] initialize_atm_procs ... done!
Atmosphere step = 0
  model time = 0001-01-01 00:00:00

WARNING: no daytime columns found for this chunk!

I will try running again with restarts disabled.

bartgol commented 2 years ago

Ugh, that seems to hint at a problem during the first timestep. Maybe restart was set to every step?

Jim, you can also check the homme_atm.log.XYZ (or a name similar to that). You should only see homme output in there, but it may confirm that only the first timestep ran (homme runs N>1 substeps per atm step, so you may see something like nsteps = 3 in there).

jgfouca commented 2 years ago

@bartgol , looking at the homme log, the highest step output I see is:

nstep= 6 time= 1800.0000000000000 [s]

jgfouca commented 2 years ago

Note, this case is set to only run 2 steps: SMS_D_Ln2_P256.ne30_ne30.F2000SCREAMv1. The Ln2 means stop at step 2.

Both STOP_N and REST_N are set to 2. I will try setting REST_N to something higher so no restart files will be produced.

bartgol commented 2 years ago

Note, this case is set to only run 2 steps: SMS_D_Ln2_P256.ne30_ne30.F2000SCREAMv1. The Ln2 means stop at step 2.

Today I learned a new thing.

jgfouca commented 2 years ago

@bartgol , CIME offers a variety of neat little options, we call them "test options" (as opposed to test mods which come at the end and are custom scripts). These test options are just little strings that are appended to the test type (SMS in this case, so SMS_testopt1_testopt2) that make it convenient to tweak behavior.

jgfouca commented 2 years ago

It looks like I'm getting the same error with REST_N set to 10. Since STOP_N is set to 2, I'd expect no restart files at all. Are restart files compulsory at the end of the run?

PeterCaldwell commented 2 years ago

I'm pretty sure restarts aren't compulsatory. That's really weird. Oh - I have some vague recollection of weird behavior where REST_N was getting overwritten to 2 unless you completely delete the section of namelist_scream.xml related to restarts. Maybe try that? Or check whether REST_N is still 10 in the run that failed?

jgfouca commented 2 years ago

@PeterCaldwell

% ./xmlquery REST_N
    REST_N: 10
bartgol commented 2 years ago

An obvious check: did you check that this indeed translated into

Scorpio:
  Model Restart:
    Output Control:
      Frequency: 10

in the yaml file? Might be our buildnml messed up something...

I also recall we wanted to write a restart file at the end of the run, regardless of the step count. I don't think we ever impl-ed that, though.

jgfouca commented 2 years ago

@bartgol , aha! That was the problem.

PeterCaldwell commented 2 years ago

I got ne30 to run for 1 day (48 steps) on Summit by completely deleting the "Model Restart:" chunk of code in namelist_scream.xml.

There were some weird minor issues though:

bartgol commented 2 years ago

What was the git sha you tested?

The latter might be b/c of the settings in config_machines.xml.

PeterCaldwell commented 2 years ago

b078784a2bab4f4b0e1b2d5201248fcbe350555a . Master as of this morning.

bartgol commented 2 years ago

Actually, I think I've seen GPTLstopf thread 0: timer for a_f:Total had not been started too. After a bit of digging, I think this is where it generates.

@rljacob Does MCT expect ATM to start a "a_f" timer (perhaps during finalization)? I don't see anything similar in EAM, but maybe it's buried somewhere, hidden by a string concat that makes it impossible to grep...

rljacob commented 2 years ago

I'm not very familiar with the timer logic. Maybe @amametjanov or @sarats can help.

jgfouca commented 2 years ago

@PeterCaldwell , @bartgol , with restarts turned off, SMS_D_Ln2_P256.ne30_ne30.F2000SCREAMv1.summit_gnugpu runs better. It appears to make it to the end of the run but TestStatus is still saying that the RUN phase failed. I checked all the log files and the only indication that anything went wrong is in e3sm.log:

ERROR:  One or more process (first noticed rank 258) terminated with signal 12

Unfortunately, signal 12 is one of the more ambiguous ones SIGUSR2 | User defined signal 2

jgfouca commented 2 years ago

Ah, looks like I timed out!

Results in group case.test
    JOB_WALLCLOCK_TIME: 02:00
Started at Fri May  6 18:36:56 2022
Terminated at Fri May  6 20:37:40 2022

So it was probably hung on

The run doesn't give back its cores when it finishes. I get a bunch of "GPTLstopf thread 0: timer for a_f:Total had not been started." complaints then "[EAMXX] Finalize ... done!", but have to kill the job myself."

jgfouca commented 2 years ago

Latest summit fun: internal compiler segf on votd master in debug mode: scream_control.dir/surface_coupling.cpp.o

bartgol commented 2 years ago

Yay...We had something happening in SC in the past, and ended up putting one of the methods in a separate cpp. Perhaps a similar trick would work here too. But we'd have to bisect the code, to figure out which fcn trips the compiler.

ndkeen commented 2 years ago

Note that summit using gnu 7.5 -- if easy, might be worth testing with whatever default is on the machine (unless that's it).

bartgol commented 2 years ago

Good point! Sometimes upgrading the compiler version fixes some ICEs.

jgfouca commented 2 years ago

It looks like gnu9 is not supported by the CUDA we are trying to use on summit:

    /sw/summit/cuda/10.1.168/bin/../targets/ppc64le-linux/include/crt/host_config.h:129:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
      129 | #error -- unsupported GNU version! gcc versions later than 8 are not supported!
          |  ^~~~~

There are no gcc8 modules available on summit, so we'd be forced to use a newer CUDA if we wanted a newer GCC. Current CUDA we are using is 10.1.168. I vaguely remember someone saying we did not support CUDA 11 yet, is that true? There are lots of CUDA 11 modules available on summit.

ndkeen commented 2 years ago

fyi, on Perlmutter, we've been using cudatoolkit/11.5 (with gnu 11.2 -- also works with 10.3)

jgfouca commented 2 years ago

Thanks @ndkeen , I will try bumping up the CUDA on summit then.

sarats commented 2 years ago

I'm not sure if you folks still have this issue: GPTLstopf thread 0: timer for a_f:Total had not been started

Typically, this is because the corresponding timer start is either conditional (which wasn't met) or removed while refactoring etc.

With the prefix you identified, do you have a t_startf('Total') somewhere corresponding t_stopf('Total') for the finalization stuff?

jgfouca commented 2 years ago

Latest summit problems. With these modules:

Currently Loaded Modules:
  1) lsf-tools/2.0                4) xalt/1.2.1             7) subversion/1.14.0        10) nsight-systems/2021.3.1.54  13) git/2.31.1           16) spectrum-mpi/10.4.0.3-20210112  19) netcdf-fortran/4.4.5
  2) hsi/5.0.2.p5                 5) DefApps                8) essl/6.3.0               11) cuda/11.5.2                 14) cmake/3.20.2         17) hdf5/1.10.7                     20) parallel-netcdf/1.12.2
  3) darshan-runtime/3.3.0-lite   6) python/3.7-anaconda3   9) nsight-compute/2021.2.1  12) gcc/9.3.0                   15) netlib-lapack/3.8.0  18) netcdf-c/4.8.0

I get this error:

In file included from /sw/summit/cuda/11.5.2/bin/../targets/ppc64le-linux/include/thrust/system/cuda/detail/execution_policy.h:35,
                 from /sw/summit/cuda/11.5.2/bin/../targets/ppc64le-linux/include/thrust/iterator/detail/device_system_tag.h:23,
                 from /sw/summit/cuda/11.5.2/bin/../targets/ppc64le-linux/include/thrust/iterator/detail/iterator_facade_category.h:22,
                 from /sw/summit/cuda/11.5.2/bin/../targets/ppc64le-linux/include/thrust/iterator/iterator_facade.h:37,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/YAKL/cub/cub/device/../iterator/arg_index_input_iterator.cuh:48,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/YAKL/cub/cub/device/device_reduce.cuh:41,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/YAKL/cub/cub/cub.cuh:53,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/YAKL/src/YAKL_header.h:44,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/externals/YAKL/src/YAKL.h:4,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/components/scream/../eam/src/physics/rrtmgp/external/cpp/rrtmgp_const.h:4,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/kernels/mo_gas_optics_kernels.h:4,
                 from /gpfs/alpine/cli115/proj-shared/acmetest/scream/components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp:2:
/sw/summit/cuda/11.5.2/bin/../targets/ppc64le-linux/include/thrust/system/cuda/config.h:75:2: error: #error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so yo\
u no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this.

It looks like the gnugpu_summit.cmake file already should be doing what this error message recommends:

string(APPEND CPPDEFS " -DTHRUST_IGNORE_CUB_VERSION_CHECK")
ndkeen commented 2 years ago

I had to also add that flag to perlmutter.cmake

jgfouca commented 2 years ago

With the compiler bump, the only remaining issue is the hang upon resource deallocation which then leads to timeout. The models seem to run to completion just fine.

ambrad commented 1 year ago

I think we can close this since we're running fine on Summit and the Ascent nightlies have been in good shape for months.