RussTreadon-NOAA commented 11 months ago

run_job.sh comments

We start cycling with gdasprep. run_job.sh -c ./config.sh -t gdasprep works but run time printout may confuse users. rocotocheck detects that some gdasprep pre-requisites are not satisfied. The script then forces the job to run via rocotoboot. The gdasprep job successfully runs to completion.


Orion-login-2:/work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI$ /work2/noaa/da/rtreadon/git/jedi-t2o/pr89/GDAS-validation/run_job.sh -c /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI/config.sh -t gdasprep
===============================================================================
===============================================================================

Task: gdasprep account: da-cpu command: /work2/noaa/da/rtreadon/gdas-validation/global-workflow/jobs/rocoto/prep.sh cores: 4 cycledefs: gdas final: false jobname: gdas_eval_satwind_GSI_gdasprep_00 join: /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_GSI/logs/2021080100/gdasprep.log maxtries: 2 memory: 40G name: gdasprep nodes: 2:ppn=2:tpp=1 partition: orion queue: batch throttle: 9999999 walltime: 00:30:00 environment CDATE ==> 2021080100 CDUMP ==> gdas COMROOT ==> /work/noaa/global/glopara/com DATAROOT ==> /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI EXPDIR ==> /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI HOMEgfs ==> /work2/noaa/da/rtreadon/gdas-validation/global-workflow NET ==> gfs PDY ==> 20210801 RUN ==> gdas RUN_ENVIR ==> emc cyc ==> 00 dependencies AND is not satisfied SOME is not satisfied gdaspost_anl of cycle 202107311800 is not SUCCEEDED gdaspost_f000 of cycle 202107311800 is not SUCCEEDED gdaspost_f003 of cycle 202107311800 is not SUCCEEDED gdaspost_f006 of cycle 202107311800 is not SUCCEEDED gdaspost_f009 of cycle 202107311800 is not SUCCEEDED

Cycle: 202108010000 Valid for this task: YES State: active Activated: 2023-10-13 14:21:14 UTC Completed: - Expired: -

Job: This task has not been submitted for this cycle

Task can not be submitted because: Dependencies are not satisfied

=============================================================================== Rewinding and booting gdasprep for cycle=2021080100 202108010000: Rewind tasks for 202108010000 in state "activated" since 2023-10-13 14:21:14 202108010000: No tasks to rewind. task 'gdasprep' for cycle '202108010000' has been booted

===============================================================================

Task: gdasprep account: da-cpu command: /work2/noaa/da/rtreadon/gdas-validation/global-workflow/jobs/rocoto/prep.sh cores: 4 cycledefs: gdas final: false jobname: gdas_eval_satwind_GSI_gdasprep_00 join: /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_GSI/logs/2021080100/gdasprep.log maxtries: 2 memory: 40G name: gdasprep nodes: 2:ppn=2:tpp=1 partition: orion queue: batch throttle: 9999999 walltime: 00:30:00 environment CDATE ==> 2021080100 CDUMP ==> gdas COMROOT ==> /work/noaa/global/glopara/com DATAROOT ==> /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI EXPDIR ==> /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI HOMEgfs ==> /work2/noaa/da/rtreadon/gdas-validation/global-workflow NET ==> gfs PDY ==> 20210801 RUN ==> gdas RUN_ENVIR ==> emc cyc ==> 00 dependencies AND is not satisfied SOME is not satisfied gdaspost_anl of cycle 202107311800 is not SUCCEEDED gdaspost_f000 of cycle 202107311800 is not SUCCEEDED gdaspost_f003 of cycle 202107311800 is not SUCCEEDED gdaspost_f006 of cycle 202107311800 is not SUCCEEDED gdaspost_f009 of cycle 202107311800 is not SUCCEEDED

Cycle: 202108010000 Valid for this task: YES State: active Activated: 2023-10-13 14:21:14 UTC Completed: - Expired: -

Job: druby://Orion-login-2.HPC.MsState.Edu:45257 State: SUBMITTING (SUBMITTING) Exit Status: - Tries: 0 Unknown count: 0 Duration: 0.0

===============================================================================


2. Surprisingly, execution of `run_job.sh -c config.sh -t gdasprep` also submits 2021080100 gdasfcst.   Script `run_job.sh` executes rocotorun before rocotocheck.

let's check the status of the job you are attempting to run

just for your own information

echo "===============================================================================" echo "===============================================================================" rocotorun $ROCOTOEXP rocotocheck $ROCOTOEXP -c ${PDY}${cyc}00 -t $TaskName

The rocotorun submits gdasfcst due to the `or` logic in the xml dependencies with the `not` check being satisfied.

    <dependency>
            <or>
                    <and>
                            <taskdep task="gdassfcanl"/>
                    </and>
                    <not><cycleexistdep cycle_offset="-06:00:00"/></not>
            </or>
    </dependency>

A rocotocheck of 2021080100 gdasfcst returns

dependencies OR is satisfied AND is not satisfied gdassfcanl of cycle 202108010000 is not SUCCEEDED NOT is satisfied cycle 202107311800 does not exist


I think the `gdas_half` cycledef is responsible to triggering submission of the 2021080100 gdasfcst.   

The submitted gdasfcst job dies in fcst.sh.   This isn't critical since gdas_validation does not run the forecast step.   It could, however, confuse users.   They may wonder why gdasfcst run.

RussTreadon-NOAA commented 11 months ago

@CoryMartin-NOAA , I made two of my suggested changes in my working copy on Orion

config.anal: reset info and error files to fix version
run_job.sh: remove db if-block and 2nd rocotorun. retain sequence of rocotocheck, rocotorewind, rocotoboot, and rocotocheck.

Shall I commit and push these changes to feature/gdasapp-sprint?

CoryMartin-NOAA commented 11 months ago

@RussTreadon-NOAA sure! I appreciate your detailed testing

RussTreadon-NOAA commented 11 months ago

JEDI testing run_job.sh successfully runs gdasprep and gdasprepatmiodaobs.

It also submits gdasatmanlinit but this job fails. The failure is due to the init job not finding expected ioda dump files. gdas.cd/parm/atm/obs/lists/gdas_prototype_3d.yaml specifies amsua_n19 and sondes. Neither obsfile specified in these yamls exists in COMROT. This failure is not a feature/gdasapp-sprint issue. Users will need to properly configure yamls in gdas.cd/parm prior to executing atmanlinit and atmanlrun.

CoryMartin-NOAA commented 11 months ago

Thank you @RussTreadon-NOAA ! I will mark this as ready for review and wait for either @emilyhcliu or @ADCollard to review before merging this in. Thanks again for your testing, comments, and contributions!

RussTreadon-NOAA commented 11 months ago

Question

While the gdas-validation GSI backgrounds are on a 3072 x 1536 gaussian grid, gsi.x performs the innovation calculations on the 1536 x 768+2 grid. The 3072 x 1536 gaussian grid maps to C768. The 1536 x 768 grid maps to C384.

If we are to compare gsi.x and fv3jedi_var.x innovations, it seems the fv3jedi_var.x innovations should be computed on a C384 grid. If true, the variational section of gdas.t00z.atmvar.yaml would be

      npx: '385'
      npy: '385'

instead of

      npx: '769'
      npy: '769'

Am I correctly thinking through what we are trying to do through the gdas-validation?

CoryMartin-NOAA commented 11 months ago

@RussTreadon-NOAA indeed, if we wanted to do the best quantitative analysis possible, that is what we would do. It's frustrating that GSI performs innovations at the analysis resolution. I have (somewhere) an interpolated berror_stats file that allows for the full resolution. Which do people think is better to do? I'm not sure it's possible to have a C768 background in JEDI but only compute innovations at C384.

CoryMartin-NOAA commented 11 months ago

@RussTreadon-NOAA I have on Orion: /work2/noaa/da/cmartin/UFO_eval/GSI/fix/Big_Endian/global_berror.l127y1538.f77 that we can try to use

RussTreadon-NOAA commented 11 months ago

@CoryMartin-NOAA , gsi.x ran to completion on the 3072 x 1536 grid using your berror file. Let me submit fv3jedi_var.x using this berror file. A quicker alternative is to run fv3jedi_var.x with the identity B. The gdas-validation is not looking at the analysis, right?

If we find that using the 3072 x 1536 grid is too expensive either in terms of wall time or memory usage, we could try operational resolution global EnKF files. Operations runs the global EnKF at C384 (1536 x 768 grid). Operational C384 global EnKF tiles and 1536 x 768 gaussian grid atmf006 could be used to run fv3jedi_var.x and gsi.x, respectively, at the same analysis resolution. One drawback of this approach is that operational run history tapes do not save member atmfXXX files. Operations only writes ensemble mean atmfXXX files to tape. We would need to run g-w enkfgdasefcsXX to generate member atmfXXX files to use with the 2021080100 analysis. If we jump to a real-time case we can grab both tiles and atmfXXX files from operational COM directories.

CoryMartin-NOAA commented 11 months ago

The gdas-validation is not looking at the analysis, right?

Correct, not yet, but we will eventually (early 2024?)

RussTreadon-NOAA commented 11 months ago

A 2021080100 run of fv3jedi_var.x processing goes-16 satwind using layout {8,8} with C768 tiles & the 1536 x 768 berror file finished with the following timing and memory usage statistics:

  0: OOPS_STATS ------------------------------------------------------------------------------------------------------------------
  0: OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
  0: OOPS_STATS ------------------------------------------------------------------------------------------------------------------
  0: OOPS_STATS Name                                                :     min (ms)    max (ms)    avg (ms)     % total   imbal (%)
  0: OOPS_STATS oops::Covariance::SABER::inverseMultiply            :    227591.12   227669.43   227612.77        3.71        0.03
  0: OOPS_STATS oops::Covariance::SABER::multiply                   :    261105.49   261303.41   261220.50        4.26        0.08
  0: OOPS_STATS oops::Geometry::Geometry                            :      5492.71     7686.21     7351.81        0.12       29.84
  0: OOPS_STATS oops::Increment::write                              :   5820505.47  5821278.51  5821100.66       94.83        0.01
  0: OOPS_STATS oops::LinearVariableChange::changeVarAD             :      7447.42     8316.85     7951.56        0.13       10.93
  0: OOPS_STATS oops::State::State                                  :      7533.95     8385.33     7869.69        0.13       10.82
  0: OOPS_STATS util::Timers::Total                                 :   6138258.94  6138353.75  6138308.32      100.00        0.00
  0: OOPS_STATS util::Timers::measured                              :   6132992.73  6134195.91  6133637.63       99.92        0.02
  0: OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
  0:
  0: OOPS_STATS Run end                                  - Runtime:   6147.38 sec,  Memory: total:  6404.06 Gb, per task: min =    16.54 Gb, max =    26.81 Gb
  0: Run: Finishing oops::Variational<FV3JEDI, UFO and IODA observations> with status = 0
  0: OOPS Ending   2023-10-16 16:14:11 (UTC+0000)

Nearly 95% of the run time was spent writing the analysis increment file.

Turn off the increment write in the yaml and rerun. The run time reduction impact is dramatic. The reduction in memory usage is significant.

  0: OOPS_STATS ------------------------------------------------------------------------------------------------------------------
  0: OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
  0: OOPS_STATS ------------------------------------------------------------------------------------------------------------------
  0: OOPS_STATS Name                                                :     min (ms)    max (ms)    avg (ms)     % total   imbal (%)
  0: OOPS_STATS oops::Covariance::SABER::Constructor                :       393.32      449.00      404.00        0.13       13.78
  0: OOPS_STATS oops::Covariance::SABER::inverseMultiply            :    231624.99   231654.01   231632.81       74.89        0.01
  0: OOPS_STATS oops::Covariance::SABER::multiply                   :    261491.26   261619.93   261604.61       84.58        0.05
  0: OOPS_STATS oops::Geometry::Geometry                            :      6824.49     7243.58     7076.65        2.29        5.92
  0: OOPS_STATS oops::GeometryData::setGlobalTree                   :      3157.38     3623.08     3227.16        1.04       14.43
  0: OOPS_STATS oops::GetValues::GetValues                          :       161.52      611.11      354.64        0.11      126.77
  0: OOPS_STATS oops::GetValues::fillGeoVaLs                        :        72.31     1630.76     1538.95        0.50      101.27
  0: OOPS_STATS oops::GetValues::fillGeoVaLsTL                      :        19.83      421.54      393.46        0.13      102.10
  0: OOPS_STATS oops::GetValues::process                            :       208.52     2352.99      382.93        0.12      560.02
  0: OOPS_STATS oops::Increment::Increment                          :      2913.73     3663.99     2999.27        0.97       25.01
  0: OOPS_STATS oops::Increment::axpy                               :      1406.65     1835.24     1464.22        0.47       29.27
  0: OOPS_STATS oops::Increment::diff                               :       330.74      436.77      342.83        0.11       30.93
  0: OOPS_STATS oops::Increment::dot_product_with                   :      2086.46     3309.14     2996.95        0.97       40.80
  0: OOPS_STATS oops::Increment::fromFieldSet                       :       482.21      739.29      509.64        0.16       50.44
  0: OOPS_STATS oops::Increment::operator=                          :      1573.98     1795.83     1602.51        0.52       13.84
  0: OOPS_STATS oops::Increment::print                              :       299.39      946.30      662.25        0.21       97.68
  0: OOPS_STATS oops::Increment::toFieldSet                         :      2080.85     2746.18     2435.03        0.79       27.32
  0: OOPS_STATS oops::LinearVariableChange::changeVarAD             :      7625.12     8595.81     8166.47        2.64       11.89
  0: OOPS_STATS oops::LinearVariableChange::changeVarTL             :      4544.07     4910.49     4863.78        1.57        7.53
  0: OOPS_STATS oops::LinearVariableChange::changeVarTraj           :       801.66     1450.11     1317.71        0.43       49.21
  0: OOPS_STATS oops::ObsSpace::ObsSpace                            :      1745.61     1925.97     1888.52        0.61        9.55
  0: OOPS_STATS oops::ObsSpace::save                                :      2885.85     3153.15     3020.30        0.98        8.85
  0: OOPS_STATS oops::Parameters::deserialize                       :       300.38      546.35      427.25        0.14       57.57
  0: OOPS_STATS oops::State::State                                  :      4553.60     4981.70     4743.92        1.53        9.02
  0: OOPS_STATS oops::State::toFieldSet                             :       682.81      994.63      786.53        0.25       39.65
  0: OOPS_STATS oops::VariableChange::changeVar                     :      2145.15     2602.78     2375.62        0.77       19.26
  0: OOPS_STATS util::Timers::Total                                 :    309258.09   309313.59   309287.21      100.00        0.02
  0: OOPS_STATS util::Timers::measured                              :    303973.39   304525.54   304187.74       98.35        0.18
  0: OOPS_STATS ---------------------------------- Parallel Timing Statistics ( 384 MPI tasks) -----------------------------------
  0:
  0: OOPS_STATS Run end                                  - Runtime:    317.12 sec,  Memory: total:  3748.38 Gb, per task: min =     9.42 Gb, max =    10.00 Gb
  0: Run: Finishing oops::Variational<FV3JEDI, UFO and IODA observations> with status = 0
  0: OOPS Ending   2023-10-16 17:53:13 (UTC+0000)

RussTreadon-NOAA commented 11 months ago

@CoryMartin-NOAA , @ADCollard , and @emilyhcliu - two items

First, with Cory's 3072 x 1536 berror file we can remove

export JCAP_A=766
export JCAP_ENKF=766
export LONA=1536
export LATA=768

from GDAS-validation/gdas_config/config.anal. I can push this change to feature/gdasapp-sprint

Before doing so, we need to decide where to stage the 3072 x 1536 berror file. Two options

ask EID to add it to their directory of staged GSI files
DAD stages this file on Orion and Hera. We add BERROR=/path/to/global_berror.l127y1538.f77 to GDAS-validation/gdas_config/config.anal

Option 1 is easier for us in the long run since the file will always be accessible via g-w. My concern is that developers will see the 3072 x 1536 berror file and perform experiments. We have not extensively tested this berror file to see how well it performs. Option 1 takes longer to implement since we need to need to work through the g-w procedure to add a new fix file.

Option 2 can be quickly implemented. We place the 3072 x 1536 file in a fixed location and add `BERROR to config.anal.

I can work on either option, just let me know path to take.

Second, we should change GDASApp parm/atm/variational/3dvar_dripcg.yaml in order to speed up fv3jedi_var.x for gdas-validation

reduce to 1 outer loop with 1 iteration
turn off creation of analysis increment file

If you are OK with these changes, I'll commit them to feature/gdasapp-sprint.

We can see further reductions in fv3jedi_var.x wall time and resource usage if we change the STATICB_TYPE in config.atmanl from gsibec to identity. The C768 2021080100 satwind goes-16 case completes in 42.20 seconds when using the identity matrix for B instead of the 3072 x 1536 gsibec. As an added benefit using the identity matrix B reduces the compute footprint to 10 nodes.

The 3072 x 1536 gsibec configuration using 96 nodes ran to completion in 317.12 seconds. I was able to reduce the compute footprint down to 26 nodes but the run time increased to 550.89 seconds.

If you'd like me to switch to the identity matrix for the gdas-validation berror, let me know and I'll commit this change to feature/gdasapp-sprint.

CoryMartin-NOAA commented 11 months ago

My concern is that developers will see the 3072 x 1536 berror file and perform experiments.

I would be very surprised if we have any developers who will be running GSI in that specific configuration. That would require either 1) 3DVar at the full resolution or 2) the ensemble to be C768. While I agree that would be a concern that it hasn't been vetted, I don't think it's one to worry too much about.

Having said that, do we bother with an 'official' staging of this file? It'll likely never be used for cycling, except for JEDI prototype evaluation. Because of this, I vote for option 2.

I also am on board with: reducing the number of iterations, removing the increment write (or having it write native grid?) and going to use the identity B. For where we are now, end to end testing really means 'can we assimilate?' and not 'are we getting identical increments', so I think all of these concessions are fine.

RussTreadon-NOAA commented 11 months ago

Thanks @CoryMartin-NOAA for your reply. 4e668bb commits the following changes

GDAS-validation/gdas_config/3dvar_dripcg.yaml - remove analysis increment write
GDAS-validation/gdas_config/config.anal - set BERROR to point at C768 static B
GDAS-validation/gdas_config/config.atmanl - set STATICB_TYPE to identity
GDAS-validation/setup_workspace.sh - copy yaml and fix file to appropriate locations

The updated setup_workspace.sh includes the lines

  # copy C768 berror file for GSI gdas-validation
  cp -rf $ICSDir/global_berror.l127y1538.f77 $workdir/global-workflow/fix/gsi/Big_Endian/

config.anal picks up the C768 berror file from BERROR=${FIXgsi}/Big_Endian/global_berror.l127y1538.f77

Script setup_workspace.sh sets ICSDir based on the machine. The script currently has

orion: ICSDir=${ICSDir:-/work2/noaa/da/cmartin/UFO_eval/data/para/output_ufo_eval_aug2021}
hera: ICSDir=${ICSDir:-/scratch1/NCEPDEV/da/Cory.R.Martin/blah/blah}

On Orion we need to copy /work2/noaa/da/cmartin/UFO_eval/GSI/fix/Big_Endian/global_berror.l127y1538.f77 to /work2/noaa/da/cmartin/UFO_eval/data/para/output_ufo_eval_aug2021/

CoryMartin-NOAA commented 11 months ago

On Orion we need to copy /work2/noaa/da/cmartin/UFO_eval/GSI/fix/Big_Endian/global_berror.l127y1538.f77 to /work2/noaa/da/cmartin/UFO_eval/data/para/output_ufo_eval_aug2021/

Done

RussTreadon-NOAA commented 11 months ago

Thank you @CoryMartin-NOAA for copying the C768 berror file to the Orion ICSDir.

I forget a critical detail. g-w link_workflow.sh links the EID GSI fix directory to the local g-w GSI fix. Therefore, we can not copy (or link) the C768 berror file into g-w GSI fix (bummer!).

We need to change the BERROR path in config.anal. We can not use ICSDir in config.anal because g-w does not know about ICSDir. g-w knows about EXPDIR.

Given this, I changed the C768 berror copy in setup_workspace.sh to a link from ICSDir to $EXPDIR/${PSLOT}_GSI. The definition of BERROR was updated accordingly in config.anal. Done at 138d20a.

Let me now retest everything starting from a clean install of this PR.

RussTreadon-NOAA commented 11 months ago

Orion test

Clone feature/gdasapp-sprint in /work2/noaa/da/rtreadon/git/jedi-t2o/pr89. Execute, in sequence,

cd /work2/noaa/da/rtreadon/git/jedi-t2o/pr89/GDAS-validation
./setup_workspace.sh -c
./setup_workspace.sh -b
./setup_workspace.sh -s

This installed g-w, created two expdir, one for GSI and one for JEDI, and populated comrot.

GSI test Execute the following in sequence

cd /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI
cp /work2/noaa/da/rtreadon/git/jedi-t2o/pr89/GDAS-validation/config_example_gsi.sh ./config_gsi.sh
./run_job.sh -c ./config_gsi.sh -t gdasprep - this submits gdasprep. User needs to enter y when prompted to start the cycle. Note: I soft linked /work2/noaa/da/rtreadon/git/jedi-t2o/pr89/GDAS-validation/run_job.sh to the GSI expdir.
./run_job.sh -c ./config_gsi.sh -t gdasanal - this runs gsi.x
./run_job.sh -c ./config_gsi.sh -t gdasanaldiag - this creates diagnostic file tarballs

The gdas prep, anal, and analdiag jobs successfully ran to completion. Output is in /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_GSI

JEDI test For JEDI I only processed satwind_goes-16 so I manually edited

/work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/gdas.cd/parm/atm/obs/config/satwind_goes-16.yaml
/work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/gdas.cd/parm/atm/obs/config/satwind_goes-16.yaml

After this execute the following in sequence

cd /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_JEDI
cp /work2/noaa/da/rtreadon/git/jedi-t2o/pr89/GDAS-validation/config_example_jedi.sh ./config_jedi.sh
./run_job.sh -c ./config_jedi.sh -t gdasprep - this runs gdasprep
./run_job.sh -c ./config_jedi.sh -t gdasprepatmiodaobs - this creates ioda format dump files from bufr dump files
./run_job.sh -c ./config_jedi.sh -t gdasatmanlinit - run setup for fv3jedi_var.x
./run_job.sh -c ./config_jedi.sh -t gdasatmanlrun - this runs fv3jedi_var.x. The job died with an OOM kill. We can not run the C768L127 analysis on one node with 40 tasks.

Manually edit the following files in the JEDI expdir

config.atmanl - set layout_x and layout_y both to 8
config.resources - set layout_x and layout_y both to 8 in atmanlrun section
gdas_eval_satwind_JEDI.xml - set <nodes>10:ppn=40:tpp=1</nodes> in the atmanlrun section

After this, execute

./run_job.sh -c ./config_jedi.sh - gdasatmanlinit - need to rerun this to update gdas.t00z.atmvar.yaml
./run_job.sh -c ./config_jedi.sh -t gdasatmanlrun

This time fv3jedi_var.x ran to completion. Output is in /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI

Orion compute nodes have 192 Gb. Hera compute nodes have 96 Gb. Add logic to configure.resources to lower Hera ppn (variable npe_node_atmanlrun) for atmanlrun. The Hera change has not been tested.

The JEDI test highlights the need for additional changes to config files in feature/gdasapp-sprint. The above mentioned changes to config.atmanl and config.resources have been committed to feature/gdasapp-sprint. Done at bb0a2d8.

RussTreadon-NOAA commented 11 months ago

Update working copy of feature/gdasapp-sprint to https://github.com/NOAA-EMC/JEDI-T2O/commit/bb0a2d8a355219deac154a5eef1292aa01ea1784 and repeat Orion test for GSI and JEDI. Jobs to be exercised in GDAS-validation sprint run to completion.

RussTreadon-NOAA commented 11 months ago

@CoryMartin-NOAA , GDASApp issue #680 and g-w issue #1936 will impact gdas-validation. It will be a while before PRs are opened for these issues, reviewed, approved, and merged. Just a heads up.

I noticed config.ufs in GDAS-validation/gdas_config. Comments in config.ufs state hack to make spurious fcst job go quickly. Which forecast job will developers be running when working on gdas-validation?

CoryMartin-NOAA commented 11 months ago

Thanks @RussTreadon-NOAA . They will not be running any forecast jobs any time soon, but the previous version of the run_job script was kicking off gdasfcst from rocotorun. With just boot, I guess that does not happen anymore? If that's the case, we can remove those changes from this PR.

RussTreadon-NOAA commented 11 months ago

Yes, we do not need fcst when we rocotoboot.if we start with ./run_job.sh -c config.sh -t gdasprep

Thanks @RussTreadon-NOAA . They will not be running any forecast jobs any time soon, but the previous version of the run_job script was kicking off gdasfcst from rocotorun. With just boot, I guess that does not happen anymore? If that's the case, we can remove those changes from this PR.

Yes, rocotoboot forces submission of the specified job. Since ICSDir contains the files we need to warm start we do not need to start cycling with fcst. This is consistent with how setup_workspace.sh executes setup_expt.py. We pass --start warm to setup_expt.py. I can remove config.ufs from feature/gdasapp-sprint

CoryMartin-NOAA commented 11 months ago

That would be great, thank you!

RussTreadon-NOAA commented 11 months ago

@CoryMartin-NOAA , @ADCollard , and @emilyhcliu - two potential additions to setup_workspace.sh to consider. Both additions are related to the use of run_job.sh

@CoryMartin-NOAA added run_job.sh to simplify submission of g-w jobs. This script takes two arguments. The first argument is a configuration file. Sample config files are provided in feature/gdasapp-sprint/GDAS-validation.

For my tests of PR #89, I manually copy the config_example*sh files to the appropriate expdir created by setup_workspace.sh -s I also soft link run_job.sh to each expdir. This allows me to be in expdir and simply type, for example,

./run_job.sh -c config_gsi.sh -t gdasanal
./run_job.sh -c config_jedi.sh -t gdasatmanlrun

We could add the following to the -s section of setup_workspace.sh

  # link run_job script to both EXPDIR
  ln -fs $mydir/run_job.sh $EXPDIR/${PSLOT}_GSI/run_job.sh
  ln -fs $mydir/run_job.sh $EXPDIR/${PSLOT}_JEDI/run_job.sh
  # copy run_job configuration to each EXPDIR
  cp $mydir/config_example_gsi.sh $EXPDIR/${PSLOT}_GSI/config_gsi.sh
  cp $mydir/config_example_jedi.sh $EXPDIR/${PSLOT}_JEDI/config_jedi.sh

to make it easier for users to directly work from expdir.

I can commit this change to setup_workspace.sh and push it to feature/gdasapp-sprint. Just let me know.

CoryMartin-NOAA commented 11 months ago

@RussTreadon-NOAA that works for me. Question/suggestion: If doing this, do we modify the run_job script to change the convention from: $EXPDIR/$PSLOT to $EXPDIR and then the config script to EXPDIR=$(pwd)? Or keep it as is since it's more flexible (but albeit more manual editing for users)?

RussTreadon-NOAA commented 11 months ago

@RussTreadon-NOAA that works for me. Question/suggestion: If doing this, do we modify the run_job script to change the convention from: $EXPDIR/$PSLOT to $EXPDIR and then the config script to EXPDIR=$(pwd)? Or keep it as is since it's more flexible (but albeit more manual editing for users)?

Good question. I like the flexibility of the current structure. With this approach users can populate a given $EXPDIR with multiple $PSLOT.

For example, if I first validate satwind, my $EXPDIR contains gdas_eval_satwind directories for GSI and JEDI. After I finish satwind validation, I move onto scatwind. I change PSLOT in the gsi and jedi config_example*sh, rerun setup_workspace.sh -s and get GSI and JEDI gdas_eval_scatwind directories in $EXPDIR.

emilyhcliu commented 11 months ago

@RussTreadon-NOAA @CoryMartin-NOAA I am sorry that I did not get much time to look into this. This is merged. I will check it out and test it. I will report back with any issues I may find.

NOAA-EMC / JEDI-T2O

First round of merging in code for GDASApp end to end sprint #89

Task can not be submitted because: Dependencies are not satisfied

Job: druby://Orion-login-2.HPC.MsState.Edu:45257 State: SUBMITTING (SUBMITTING) Exit Status: - Tries: 0 Unknown count: 0 Duration: 0.0

let's check the status of the job you are attempting to run

just for your own information