esm-tools / esm_tools

Simple Infrastructure for Earth System Simulations
https://esm-tools.github.io/
GNU General Public License v2.0
25 stars 12 forks source link

AWICM3: very slow start of the first run #806

Closed shijian0702 closed 1 year ago

shijian0702 commented 2 years ago

Describe the problem you are facing

The awicm3-v3.1 always spends much time to start the first run, dealing with the partial restart files of fesom: over 30 minutes for high-resolution (Tco319-BOLD) and over 20 minutes for low resolution (TCO95-CORE2). For the second run, it just took a few seconds to read the raw restart files. As far as I know, other climate models, like awi-esm2, have no similar issue. I'm not sure whether the problem is from awicm3 or esm_tools. Is it normal or something's wrong with my setup?

Runscrip and other relevant files

general:
    user: !ENV ${USER}
    setup_name: "awicm3"
    version: "v3.1"
    account: "ab0995"
    compute_time: "08:00:00"
    initial_date: "1850-01-01"
    final_date: "1900-01-01"
    base_dir: "/work/ab0246/${user}/runtime/${general.setup_name}-${general.version}/"
    nday: 0
    nmonth: 0
    nyear: 10

awicm3:
    postprocessing: false
    model_dir: "/home/a/${user}/model_codes/${general.setup_name}-${general.version}/"
    pool_dir: "/work/ab0246/a270092/input/"

fesom:
    resolution: "CORE2"
    pool_dir: "/work/ab0246/a270092/input/fesom2/"
    mesh_dir: "${pool_dir}/core2/"
    restart_rate: 10
    restart_unit: "y"
    restart_first: 1
    lresume: true
    time_step: 2400
    nproc: 384
    ini_parent_exp_id: "D00"
    ini_parent_date: "1849-12-31"
    ini_parent_dir: "/work/ab0246/a270196/input/fesom2/restart/CORE2/"
    choose_general.run_number:
        1:
            restart_in_sources:
                par_oce_restart: /work/ab0246/a270196/input/fesom2/restart/CORE2/fesom.1849.oce.restart.2400/*.nc
                par_ice_restart: /work/ab0246/a270196/input/fesom2/restart/CORE2/fesom.1849.ice.restart.2400/*.nc
    namelist_dir: "/home/a/a270196/runscripts/namelist/spinup/fesom2/"

oifs:
    resolution: "TCO95"
    levels: "L91"
    prepifs_expid: aack
    input_expid: awi3
    wam: true
    lresume: false
    time_step: 3600
    nproc: 384
    omp_num_threads: 8
    add_namelist_changes:
        fort.4:
            NAERAD:
                NSOLARSPECTRUM: "1"
                NCMIPFIXYR: "1990"

rnfmap:
    omp_num_threads: 128

oasis3mct:
    lresume: true # Set to false to generate the rst files for first leg
    time_step: 7200

xios:
    with_model: oifs
    nproc: 8
    omp_num_threads: 16

log file of the first run: pi_awicm3_tco95_core2_awicm3_compute_18500101-18591231_1074979.log

  0:  ==========================================
  0:  MODEL SETUP took on mype=0 [seconds]
  0:  runtime setup total         1337.456
  0:   > runtime setup mesh       1.450488
  0:   > runtime setup ocean     0.4124953
  0:   > runtime setup forcing   3.6961141E-03
  0:   > runtime setup ice       1.0040575E-02
  0:   > runtime setup restart    1331.384
  0:   > runtime setup other      4.194682
  0:  ============================================

log file of the second run: pi_awicm3_tco95_core2_awicm3_compute_18600101-18691231_1076863.log

  0:  ==========================================
  0:  MODEL SETUP took on mype=0 [seconds]
  0:  runtime setup total         6.276999
  0:   > runtime setup mesh       1.795660
  0:   > runtime setup ocean     0.3997563
  0:   > runtime setup forcing   3.7931339E-03
  0:   > runtime setup ice       8.6037908E-03
  0:   > runtime setup restart   0.1019567
  0:   > runtime setup other      3.967230
  0:  ============================================

System (please complete the following information):

Actually, this problem appears in almost every awicm3 version/esm_tools/hpcs.

JanStreffing commented 2 years ago

Hello Jian, this is normal. It's making oasis remapping files (rmp*). After doing this once, you can store them in a pool directory (e.g. `/p/project/chhb19/shi4/input/oasis/cy43r3/TCO319-HR/${nprocfesom}/rmp`) and link them in for all subsequent runs. Be aware that you need a separate set of rmp_ files if you change the number of fesom cores.

You can also send the link to me and @pgierz who will add them to the default pool dir.

AWI-ESM2 is much lower resolution so the process of making these remapping files is faster.

If you go to much higher resolution still, you can follow: https://awi-cm3-documentation.readthedocs.io/en/latest/how_to.html#generate-oasis3mct-remapping-weights-for-large-grids-offline-and-mpi-omp-parallel for a faster but more work intensive solution.

pgierz commented 2 years ago

Moin,

As far as I know, other climate models, like awi-esm2, have no similar issue.

This happens in AWIESM-2 as well, you probably just do not notice it because the "extra" time being used is considerably shorter than in the high-res AWICM3 case since it needs to calculate far fewer re-gridding weights for the typical AWIESM-2 resolution (normally T63 plus CORE2)

@JanStreffing: if we want to store these things in the pool, that is in principle no problem, but we should think about a strategy to ensure that the regrid weights fit correctly to the employed atmosphere/ocean grids. Maybe some kind of checksum? That solution will need some brainstorming though....

JanStreffing commented 2 years ago

I just saw that your runscript has lresume=true for oasis. Does this work for you on the first run? I would have thought that for an initial run you need to set this to false, so that it does not try to restart oasis and goes into LEG=0 mode.

github-actions[bot] commented 1 year ago

This issue has been inactive for the last 365 days. It will now be marked as stale and closed after 30 days of further inactivity. Please add a comment to reset this automatic closing of this issue or close it if solved.