NOAA-EMC / JEDI-T2O

JEDI Transition to Operations activities.
GNU Lesser General Public License v2.1
5 stars 4 forks source link

Out-Of-Memory (OOM) in gdas validation end-to-end validation for IASI #101

Closed emilyhcliu closed 7 months ago

emilyhcliu commented 7 months ago

@RussTreadon-NOAA and @CoryMartin-NOAA I ran the end-to-end testing for IASI and got the Out-Of-Memory (OOM) error message:

285: slurmstepd: error: poll(): Bad address
srun: error: Orion-22-39: task 288: Killed
srun: Terminating StepId=15808233.0
  0: slurmstepd: error: *** STEP 15808233.0 ON Orion-03-59 CANCELLED AT 2023-11-24T20:18:45 ***
170: slurmstepd: error: Detected 1 oom_kill event in StepId=15808233.0. Some of the step tasks have been OOM Killed.
srun: error: Orion-22-16: task 173: Out Of Memory

The current resource configuration for atmanlrun process is the following:

elif [[ "${step}" = "atmanlrun" ]]; then

    # make below case dependent later
    export layout_x=8
    export layout_y=8

    export wtime_atmanlrun="00:30:00"
    npe_atmanlrun=$(echo "${layout_x} * ${layout_y} * 6" | bc)
    export npe_atmanlrun
    npe_atmanlrun_gfs=$(echo "${layout_x} * ${layout_y} * 6" | bc)
    export npe_atmanlrun_gfs
    export nth_atmanlrun=1
    export nth_atmanlrun_gfs=${nth_atmanlrun}
    npe_node_atmanlrun=$(echo "${npe_node_max} / ${nth_atmanlrun}" | bc)
    if [[ ${machine} == "HERA" ]]; then
       npe_node_atmanlrun=24
    fi
    export npe_node_atmanlrun
    export is_exclusive=True

Here is the node configuration in XML:


        <command>&JOBS_DIR;/atmanlrun.sh</command>

        <jobname><cyclestr>&PSLOT;_gdasatmanlrun_@H</cyclestr></jobname>
        <account>da-cpu</account>
        <queue>batch</queue>
        <partition>orion</partition>
        <walltime>00:30:00</walltime>
        <nodes>10:ppn=40:tpp=1</nodes>
        <native>--export=NONE</native>

        <join><cyclestr>&ROTDIR;/logs/@Y@m@d@H/gdasatmanlrun.log</cyclestr></join>

I tried a few things:

  1. increase the nodes from 10:ppm=40:tpp=1 to 40:ppm=40:tpp=1 ----> still got OOM
  2. add memory usage in resource: export memory_atmanrun="4000GB" ----> still got OOM
  3. add memory usage in resource: export memory_atmanlrun="0" (use max) ----> still got OOM
  4. do (1) and also increase the layout from 8 to 12 ---> still got OOM

Do you have suggestions for resolving the OOM problem?

Note: I have the IASI YAML and Python script ready for end-to-end in the workflow.
The test in the workflow has the OOM problem described above. So, I tested the YAML and pthon script in a separate configuration using fv3 nomodel executable. The run completed successfully. The layout is 5x5x6

srun -n 150 --immediate ${GDASAppDir}/build/bin/fv3jedi_hofx_nomodel.x ./YAML/QC/$cdate/gfs_hofx_nomodel_${obstyp}_${cdate}.yaml
RussTreadon-NOAA commented 7 months ago

@emilyhcliu , thank you for reporting this error. Would you please add to this issue the paths to your EXPDIR and COMROT? It's hard to debug without the log files and config files. Thank you.

emilyhcliu commented 7 months ago

@emilyhcliu , thank you for reporting this error. Would you please add to this issue the paths to your EXPDIR and COMROT? It's hard to debug without the log files and config files. Thank you.

EXPDIR: /work2/noaa/da/eliu/gdas-validation/expdir/gdas_eval_iasi_JEDI COMROT: /work2/noaa/da/eliu/gdas-validation/comrot/gdas_eval_iasi_JEDI/

The log file: /work2/noaa/da/eliu/gdas-validation/comrot/gdas_eval_iasi_JEDI/logs/2021080100/gdasatmanlrun.log

RussTreadon-NOAA commented 7 months ago

While /work2/noaa/da/eliu/gdas-validation/comrot/gdas_eval_iasi_JEDI/logs/2021080100/gdasatmanlrun.log is for an iasi run, the run directory it points at, /work/noaa/stmp/eliu/RUNDIRS/gdas_eval_iasi_JEDI/gdasatmanl_00, is missing the fv3jedi yaml and only contains an atms obs file. Seems you are working on atms.

emilyhcliu commented 7 months ago

While /work2/noaa/da/eliu/gdas-validation/comrot/gdas_eval_iasi_JEDI/logs/2021080100/gdasatmanlrun.log is for an iasi run, the run directory it points at, /work/noaa/stmp/eliu/RUNDIRS/gdas_eval_iasi_JEDI/gdasatmanl_00, is missing the fv3jedi yaml and only contains an atms obs file. Seems you are working on atms.

@RussTreadon-NOAA I am re-running the IASI case. /work/noaa/stmp/eliu/RUNDIRS/gdas_eval_iasi_JEDI/gdasatmanl_00 will be updated soon.

I am about to open a draft PR for IASI to add the YAML files.

emilyhcliu commented 7 months ago

@RussTreadon-NOAA I re-ran the iasi case but ran into OSError: [Errno 122] Disk quota exceeded issue. The STMP is full. I cleaned up my STMP. Waiting for others to clean up STMP. Will try to run later.

RussTreadon-NOAA commented 7 months ago

@emilyhcliu , g-w assumes that we fully pack compute nodes. This is a known limitation of the g-w xml generator. When we specify 1 thread for atmanlrun, g-w sets the number of tasks per node on Orion to 40. This is too many tasks per node when processing iasi. I modified config.resources and hand edited my xml file to run fv3jedi_var.x with 1 task per node. fv3jedi_var.x ran up to an ioda exception

  0: QC iasi_metop-a brightnessTemperature_138: 10120 passed out of 323980 observations.
 24: Exception:         Reason: An exception occurred inside ioda while opening a variable.
 24:    name:   MetaData/sensorCentralWavenumber
 24:    source_column:  0
 24:    source_filename:        /work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/gdas.cd/ioda/src/engines/ioda/src/ioda/Has_Variables.cpp
 24:    source_function:        ioda::Variable ioda::detail::Has_Variables_Base::open(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> &) const
 24:    source_line:    108
 24:
emilyhcliu commented 7 months ago

I

@emilyhcliu , g-w assumes that we fully pack compute nodes. This is a known limitation of the g-w xml generator. When we specify 1 thread for atmanlrun, g-w sets the number of tasks per node on Orion to 40. This is too many tasks per node when processing iasi. I modified config.resources and hand edited my xml file to run fv3jedi_var.x with 1 task per node. fv3jedi_var.x ran up to an ioda exception

  0: QC iasi_metop-a brightnessTemperature_138: 10120 passed out of 323980 observations.
 24: Exception:         Reason: An exception occurred inside ioda while opening a variable.
 24:    name:   MetaData/sensorCentralWavenumber
 24:    source_column:  0
 24:    source_filename:        /work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/gdas.cd/ioda/src/engines/ioda/src/ioda/Has_Variables.cpp
 24:    source_function:        ioda::Variable ioda::detail::Has_Variables_Base::open(const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> &) const
 24:    source_line:    108
 24:

The IASI yaml is outdated in the current gdas-validation. I am cooking an IASI PR and add the latest update.

RussTreadon-NOAA commented 7 months ago

Where may I find the updated iasi yaml? I'd like to test it in my gdas-validation workspace.

emilyhcliu commented 7 months ago

Where may I find the updated iasi yaml? I'd like to test it in my gdas-validation workspace.

They can be found in the following PR https://github.com/NOAA-EMC/GDASApp/pull/769

A paring UFO PR (feature/satrad) is required to test IASI.

RussTreadon-NOAA commented 7 months ago

Thank you @emilyhcliu . Testing underway.

Thanks so much! @RussTreadon-NOAA

RussTreadon-NOAA commented 7 months ago

Able to process metop-a and metop-b iasi when fv3jedi_var.x run with96:ppn=4:tpp=1`