Create a GSI branch with configuration to use CRTM-2.4.1-jedi.1

emilyhcliu commented 8 months ago

For the upcoming end-to-end ode sprint, we would like to have a GSI branch with a configuration to use CRTM 2.4.1-jedi.1, which is consistent with the CRTM used in GDASApp.

GSI The GSI branch created for this PR is in the following GSI forked repository: GSI-crtm_v2.4.1-jedi.1

These are changes to GSI.

CRTM The CRTM version 2.4.1-jedi.1 is built on ORION in the following location: /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022

The associated coefficients files are in the following location: /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian

There are three sets of CRTM coefficients we can:

CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Little_Endian
CRTM_FIX=/work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0
CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian

I tested three sets, and they all worked fine.
The first one contains the coefficients we use in the operation + N21 coefficients The second one is the one linked to GDASApp from the run_ufo_hofx_test.sh The third one is the coefficients packed with crtm_v2.4.1-jedi.1 tag.

The first set of coefficients is good for our purpose for UFO evaluation with GSI.

emilyhcliu commented 8 months ago

@CoryMartin-NOAA @RussTreadon-NOAA The GSI branch (see above) in this PR will be used with the end-to-end code sprint. The CRTM 2.4.1-jedi.1 which is consistent with the GDASApp is used in the GSI.
Please see the description above for more details.

RussTreadon-NOAA commented 8 months ago

Thank you @emilyhcliu for creating a GSI fork which can use CRTM-2.4.1-jedi.1.

I updated a working copy of feature/gdasapp-sprint to clone the forked GSI-crtm_v2.4.1-jedi.1. CRTM_FIX has been defined to point at /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian in gdas_config/config.anal and gdas_config/config.atmanl.

I am currently working through the sequence of steps to clone, build, setup, and run jobs.

RussTreadon-NOAA commented 8 months ago

@emilyhcliu , where my I find the run script you used to test the gsi.x which you built with CRTM-2.4.1-jedi.1? I'm encountering crtm library errors when I execute gsi.x from g-w.

CoryMartin-NOAA commented 8 months ago

@RussTreadon-NOAA is the issue a shared library is missing? I think, if so, then $LD_LIBRARY_PATH needs modified at runtime.

RussTreadon-NOAA commented 8 months ago

@CoryMartin-NOAA , yes the initial problem was the shared library. Thank you for the pointer. I added the crtm path to LD_LIBRARY_PATH. The updated config.anal now has two additions:

export CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/lib

With the addition of LD_LIBRARY_PATH executable gsi.x starts running. Execution, however, aborts with

55:  SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 10, file /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI/anal.101081/./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
55:  CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
55:  READ_BUFRTOVS:  ***ERROR*** crtm_spccoeff_load error_status=           3
55:   despite file ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
55:   existing,   TERMINATE PROGRAM EXECUTION

A check of ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin shows that this local file is correctly linked to @emilyhcliu ' s little endian fix.

Orion-login-4:/work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI/anal.101081$ ls -l  ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
lrwxrwxrwx 1 rtreadon stmp 92 Oct 19 16:42 ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin -> /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian/amsua_metop-b.SpcCoeff.bin
Orion-login-4:/work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI/anal.101081$ ls -lL ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
-rw-r----- 1 eliu da 12196 Oct 13 13:34 ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin

The log file contains 690927 lines of AntCorr printout. Of these, 663809 lines are Apply_AntCorr(FAILURE) : Input iFOV inconsistent with AC data. 27118 lines are Remove_AntCorr(FAILURE) : Input iFOV inconsistent with AC data. This printout is not present when I build and run gsi.x with crtm/2.4.0.

I built gsi.x with BUILD_VERBOSE=YES. I see Emily's library modules being included. This is good. Something potentially not good is -convert big_endian in the compiler options. Building with -convert big_endian & trying to read little endian seems problematic.

It would be helpful to examine a gsi.x build with the new CRTM along with a run script.

RussTreadon-NOAA commented 8 months ago

I found the following in the EMC JEDI Discussions google space

(2) need to remove HIRS4 from GSI obs namelist. There is a problem reading the HIRS4 coefficients. We do not use HIRS4, so I removed the HIRS4 from the obs namelist

This is a g-w change since EID version controls exglobal_atmos_analysis.sh. I removed HIRS4 from my working copy of the script and reran 2021080100 gdasanal. gsi.x still aborts.

emilyhcliu commented 8 months ago

@emilyhcliu , where my I find the run script you used to test the gsi.x which you built with CRTM-2.4.1-jedi.1? I'm encountering crtm library errors when I execute gsi.x from g-w.

@RussTreadon-NOAA I was out this morning to see dentist.
You can find the scripts I used to run GSI in the following directory on ORION: /work/noaa/da/eliu/git/GSI-emilyhcliu/GSI/scripts/gsi

There are three scripts (provided by Cory for our previous code sprint) gsi_observer.sh iodaconv.sh submit_gsi_observer.sh

You just need to modify the path to GSI in submit_gsi_observer.sh and submit the script. It will trigger gsi_observer.sh

ps. I already turned off iodaconv.sh.

RussTreadon-NOAA commented 8 months ago

Thank you @emilyhcliu for pointing me at the scripts you use to run gsi.x. As a first step let me try my gsi.x with you scripts.

RussTreadon-NOAA commented 8 months ago

@emilyhcliu , this is very odd. Using my gsi.x in your script fails in the same was as running it from g-w. I took a step back and used your gsi.x. Same failure. I recopied your gsi_observer.sh and submit_gsi_observer.sh to my space and resubmitted. Same failure. This suggests that something in my Orion environment differs from your environment.

@CoryMartin-NOAA , have you run gsi.x built with CRTM-2.4.1-jedi.1 using little endian coefficients?

CoryMartin-NOAA commented 8 months ago

@RussTreadon-NOAA no I have not, I thought you had to use big endian, since presumably the GSI is compiled with big endian and then the BERROR_STATS file, and others, will also be big endian.

RussTreadon-NOAA commented 8 months ago

@RussTreadon-NOAA no I have not, I thought you had to use big endian, since presumably the GSI is compiled with big endian and then the BERROR_STATS file, and others, will also be big endian.

Agreed! The gsi code is compiled with big endian compiler flags. The GSI static-B file is big endian. The CRTM coefficients being provided to gsi.x in the above runs are little endian. The endianness mismatch seems to be the problem.

The EMC JEDI Discussions note and comments above indicate that we need to use little endian coefficients

Cory Martin - NOAA Federal Andrew Collard - NOAA Federal Good news. We made the crtm v2.4.1-jedi.1 compiled with GSI develop with Cory's fix in CMakeList for GSI. I tested a single cycle (2021080100) and found the following: (1) need to use Little Endian coefficients (2) need to remove HIRS4 from GSI obs namelist. There is a problem reading the HIRS4 coefficients. We do not use HIRS4, so I removed the HIRS4 from the obs namelist

RussTreadon-NOAA commented 8 months ago

Change CRTM_FIX from /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian to /work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0.

With this change gsi.x ran to completion ... but it took a long time (1531.006869 seconds) with a huge gdasatmanal.log (82 Mb, 1719951 lines). Many AntCorr(FAILURE) and Using 5 OpenMP threads = 1 for profiles and lines are written to the log file.

CoryMartin-NOAA commented 8 months ago

@RussTreadon-NOAA @emilyhcliu can we use /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Big_Endian?

RussTreadon-NOAA commented 8 months ago

@RussTreadon-NOAA @emilyhcliu can we use /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Big_Endian?

Good suggestion. I tried. gsi.x aborted with a byte-swapped error message

63:  Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
59:  Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
63:  Open_Binary_File(FAILURE) : Error checking ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin file byte order
50:  Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.

emilyhcliu commented 8 months ago

@RussTreadon-NOAA @CoryMartin-NOAA

I added the Big_Endian files for crtm-v2.4.0_emc.3. So, we have big and little endian coefficient files for crtm-v2.4.0_emc.3 and crtm_v2.4.1-jedi.1

CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Big_Endian CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Big_Endian

The crtm_v2.4.1-jedi-1 fix files have j2 in the filename for NOAA-21 instruments. The crtm-v2.4.0_emc.3-fix has n21 in the filename for NOAA-21 instruments.

We want n21 in the filename. And, we also use amsua_metop-a_v2.SpcCoeff.bin in our operational GFS. N21 and the amsua_metop-a_v2 files are in crtm-v2.4.0_emc.3-fix only.
I suggest we use the coefficients from crtm-v2.4.0_emc.3-fix.

RussTreadon-NOAA commented 8 months ago

@emilyhcliu , unfortunately, setting export CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Big_Endian in config.anal did not result in a successful gsi.x run.

The executable aborted with the previously mentioned byte-swapped error. Here's the error message from a representative task, 36

 36:  Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
 36:  Open_Binary_File(FAILURE) : Error checking ./crtm_coeffs/cris-fsr_npp.SpcCoeff.bin file byte order
 36:  SpcCoeff_ReadFile(Binary)(FAILURE) : Error opening ./crtm_coeffs/cris-fsr_npp.SpcCoeff.bin
 36:  CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/cris-fsr_npp.SpcCoeff.bin
 36:  READ_CRIS:  ***ERROR*** crtm_spccoeff_load error_status=           3
 36:     TERMINATE PROGRAM EXECUTION
 36: Abort(71) on node 36 (rank 36 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 36
 36: In: PMI_Abort(71, application called MPI_Abort(MPI_COMM_WORLD, 71) - process 36)

RussTreadon-NOAA commented 8 months ago

Reverting gsi.x back to NOAA-EMC/GSI develop at f76d8728 runs to completion when using CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Big_Endian.

Perhaps the issue is with the gsi.x executable built from feature/GSI-crtm_v2.4.1-jedi.1

RussTreadon-NOAA commented 8 months ago

Can we ask the library team to install CRTM 2.4.1-jedi.1 on Orion?

Then we could directly load crtm/2.4.1-jedi.1 from gsi_orion.lua. Additionally, the crtm/2.4.1-jedi.1 module would define CRTM_FIX, thereby removing the need to redefine CRTM_FIX in config.anal.

Just a thought.

RussTreadon-NOAA commented 8 months ago

@emilyhcliu and @CoryMartin-NOAA , I will pause work on this issue until we have a clear path forward.

CoryMartin-NOAA commented 8 months ago

This is very strange, why would it work for @emilyhcliu but not you, @RussTreadon-NOAA . Am I correct in understanding we seem to get the same error regardless of big or little endian coefficients?

RussTreadon-NOAA commented 8 months ago

Yes, @CoryMartin-NOAA , your understanding is correct.

Given your comment, I did the following this morning

Rebuild feature/GSI-crtm_v2.4.1-jedi.1 in /work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/gsi_enkf.fd/ using ush/build.sh. I set BUILD_VERBOSE=YES prior to executing build.sh. File build.log in the ush directory captured the build. gsi.x was built using modules from /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/module/crtm/Intel/2021.5.0.20211109. Library /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/lib/libcrtm.so was linked.
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/lib was added to config.anal in /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI/.
.run_job.sh -c config_gsi.sh -t gdasanal was executed for the following CRTM_FIX (toggled in config.anal) with the indicated results

With CRTM_FIX=work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Little_Endian the GSI aborts with

101:  SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 10, file /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_GSI/anal.34500/./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
101:  CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
101:  READ_BUFRTOVS:  ***ERROR*** crtm_spccoeff_load error_status=           3
101:   despite file ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
101:   existing,   TERMINATE PROGRAM EXECUTION
101: Abort(71) on node 101 (rank 101 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 101
101: In: PMI_Abort(71, application called MPI_Abort(MPI_COMM_WORLD, 71) - process 101)

With CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Big_Endian the GSI aborts with

 13:  Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
 13:  Open_Binary_File(FAILURE) : Error checking ./crtm_coeffs/iasi_metop-a.SpcCoeff.bin file byte order
 13:  SpcCoeff_ReadFile(Binary)(FAILURE) : Error opening ./crtm_coeffs/iasi_metop-a.SpcCoeff.bin
 13:  CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/iasi_metop-a.SpcCoeff.bin
 13:  READ_IASI:  ***ERROR*** crtm_spccoeff_load error_status=           3
 13:     TERMINATE PROGRAM EXECUTION
 13: Abort(71) on node 13 (rank 13 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 71) - process 13
 13: In: PMI_Abort(71, application called MPI_Abort(MPI_COMM_WORLD, 71) - process 13)

With CRTM_FIX=/work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0 the GSI runs to completion with the following caveats

thousands of lines of Apply_AntCorr(FAILURE) : Input iFOV inconsistent with AC data and Remove_AntCorr(FAILURE) : Input iFOV inconsistent with AC data printout
thousands of lines of Using 5 OpenMP threads = 1 for profiles and printout

no metop-c amsua or mhs data assimilated. gdas.t00z.gsistat has

o-g 01 rad  metop-c   amsua           1379610            0            0    0.0000       0.0000       0.0000       0.0000
o-g 01 rad  metop-c   mhs             2859175            0            0    0.0000       0.0000       0.0000       0.0000

A run using gsi.x built from crtm/2.4.0 has

o-g 01 rad  metop-c   amsua           1379610       133621        94992   0.11806E+06  0.11806E+06   1.2429       1.2429
o-g 01 rad  metop-c   mhs             2859175        46345        17200    3566.9       3566.9      0.20738      0.20738

RussTreadon-NOAA commented 8 months ago

As an additional test, do the following

Recopy gsi_observer.sh and submit_gsi_observer.sh from /work/noaa/da/eliu/git/GSI-emilyhcliu/GSI/scripts/gsi to /work2/noaa/da/rtreadon/gdas-validation/expdir/gdas_eval_satwind_GSI
execute ./submit_gsi_observer.sh (no change to copied file)
job 15488640 submitted. Job log file is /work2/noaa/da/rtreadon/GSI-develop2/GSIobserver/2021080100/GSIobserver.o15488640. According to log file, CRTM coefficients were copied from /work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian/

check gsi.stdout in /work2/noaa/da/rtreadon/GSI-develop2/GSIobserver/2021080100/gsi/ GSI aborted with

SpcCoeff_ReadFile(Binary)(FAILURE) : Error reading channel data. input statement requires too much data, unit 10, file /work2/noaa/da/rtreadon/GSI-develop2/GSIobserver/2021080100/gsi/./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
READ_BUFRTOVS:  ***ERROR*** crtm_spccoeff_load error_status=           3
despite file ./crtm_coeffs/amsua_metop-b.SpcCoeff.bin
existing,   TERMINATE PROGRAM EXECUTION

Go back to gsi_observer.sh and add CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Big_Endian. Edit local copy of submit_gsi_observer.sh to use my modified copy of gsi_observer.sh. Execute ./submit_gsi_observer.sh GSI aborts with

Check_Binary_File(FAILURE) : Data file needs to be byte-swapped.
Open_Binary_File(FAILURE) : Error checking ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin file byte order
SpcCoeff_ReadFile(Binary)(FAILURE) : Error opening ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin
CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin
SpcCoeff_ReadFile(Binary)(FAILURE) : Error opening ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin
CRTM_SpcCoeff_Load(FAILURE) : Error reading SpcCoeff file #1, ./crtm_coeffs/cris-fsr_n20.SpcCoeff.bin
READ_CRIS:  ***ERROR*** crtm_spccoeff_load error_status=           3
TERMINATE PROGRAM EXECUTION

Change to CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm-v2.4.0_emc.3-fix/Big_Endian. GSI fails with same error message a 5.
Change to CRTM_FIX=/work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0. GSI runs to completion. Many AntCorr(Failure) and Using 1 OpenMP threads = 1 for profiles and messages written to gsi.stdout

Summary: g-w and stand-alone script behavior is consistent when run from my Orion account.

CoryMartin-NOAA commented 8 months ago

Hmm, ugh, I suggest we wait for @emilyhcliu before digging further as she apparently has the magic touch to get this working

RussTreadon-NOAA commented 8 months ago

Hmm, ugh, I suggest we wait for @emilyhcliu before digging further as she apparently has the magic touch to get this working

Agreed!

RussTreadon-NOAA commented 8 months ago

Issues sorted out in GSI gdas-validation test. I was using the wrong CRTM_FIX.

gsi.x built from @emilyhcliu feature/GSI-crtm_v2.4.1-jedi.1 runs when

add export CRTM_FIX=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix_gdasapp/fix to config.anal
add export LD_LIBRARY_PATH=/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/lib:${LD_LIBRARY_PATH} to config.anal

gsi.x was processing radiances based on log file printout when it seg faulted. I'm guessing that the seg fault is related to memory. The failed gdasanal was running gsi.x with 84 tasks on 11 nodes (ppn=8) with 5 threads per task. The GSIObserver tests run gsi.x with 200 tasks, ppn=8, threads=1. Resubmit job with GSIObserver configuration. Job immediately died with oom kill. However, in looking at the log file the problem may be a system issue and not a job issue. Will resubmit later to see what happens.

@emilyhcliu , your working copy feature/GSI-crtm_v2.4.1-jedi.1 in Orion /work/noaa/da/eliu/git/GSI-emilyhcliu/GSI contains two modified files

        modified:   modulefiles/gsi_common.lua
        modified:   modulefiles/gsi_orion.lua

I recommend that we do not modify gsi_common.lua. This file is used for GSI builds on all platforms. Instead of commenting out the crtm load in gsi_common.lua, we can unload crtm in gsi_orion.lua. I did so in the above mentioned test.

Look in Orion /work2/noaa/da/rtreadon/gdas-validation-test/global-workflow/sorc/gsi_enkf.fd. This is your feature/GSI-crtm_v2.4.1-jedi.1 branch with only one modified file

        modified:   modulefiles/gsi_orion.lua

I retained your local modification to CRTM_FIX and added an unload for crtm

 load("gsi_common")
+unload("crtm/2.4.0")
 setenv("crtm_ROOT","/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build")
 setenv("crtm_VERSION","2.4.1-jedi.1")
 setenv("CRTM_INC","/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/module")
 setenv("CRTM_LIB","/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-intel2022/build/lib/libcrtm_static.a")
-setenv("CRTM_FIX","/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix/Little_Endian")
+setenv("CRTM_FIX","/work/noaa/da/eliu/JEDI-GDAS/crtm_v2.4.1-jedi.1-fix_gdasapp/fix")
 whatis("Name: crtm")
 whatis("Version: 2.4.1-jedi.1")
 whatis("Category: library")

emilyhcliu commented 8 months ago

@RussTreadon-NOAA I updated the branch with your suggestion and did a single-cycle run (observer only). It ran to completion without issues.

RussTreadon-NOAA commented 8 months ago

Thank you @emilyhcliu . Let me keep debugging gsi.x inside the workflow. I can successfully run gsi.x using your scripts. Something odd is going on in the workflow.

RussTreadon-NOAA commented 8 months ago

Debugging found differences in some fix files, gsi namelist settings, and processing of HIRS dump files.

The issue with HIRS dump files and CRTM-2.4.1-jedi.1 was noted above. So as to not touch g-w exglobal_atmos_analysis.sh, add the following to the expdir config.anal
```
export B1HRS2=/dev/null
export B1HRS3=/dev/null
export B1HRS4=/dev/null
```
Emily's stand-alone test uses fix files from GSIFIX=/work2/noaa/da/cmartin/UFO_eval/geovals/GSI/fix. The g-w test takes GSI fix from FIXgsi=/work2/noaa/da/rtreadon/gdas-validation-test/global-workflow/fix/gsi. The following fix files differ between these two directories: ANAVINFO, CONVINFO, and OZINFO. Given this, add the following to expdir config.anal
```
export GSIFIX=/work2/noaa/da/cmartin/UFO_eval/geovals/GSI/fix
export ANAVINFO=$GSIFIX/global_anavinfo.l127.txt
export CONVINFO=$GSIFIX/global_convinfo.txt
export OZINFO=$GSIFIX/global_ozinfo.txt
```

Add the following to the expdir config.anal to address gsi namelist differences

export SETUP="gpstop=55.,${SETUP:-}"
export imp_physics=11
export cao_check=".false."
export ta2tb=".false."
export GRIDOPTS="nlayers(63)=1,nlayers(64)=1,${GRIDOPTS:-}"
export NST_GSI=0

With the above additions to expdir config.anl, the 2021080100 gdasanal job successfully ran to completion.

@emilyhcliu explained that the HIRS change is required when using CRTM-2.4.1-jedi.1. What about the fix file and gsi namelist changes? Which of these changes do we need or want to include in gdas-validation?

RussTreadon-NOAA commented 8 months ago

@emilyhcliu , @ADCollard , & @CoryMartin-NOAA

Additional differences between run using Emily's stand-alone script and the gdas-validation g-w gdasanal job:

while both runs use an anavinfo file which includes the Rcov section, the stand-alone run does not copy the Rcov files to the run directory. The gdas-validation copies the Rcov files to the run directory and gsi.x uses them.
thin4d=.true. in the stand-alone job. thin4d=.false. in the gdasanal job. This changes cpen for several observation types
the obs_input namelist differs considerably between the two runs.
- he stand-alone run has dsfcalc=0 for all obs types. The gdasanal run sets dsfcalc=1 for numerouls radiance datasets.
- the two runs set dthin differently for some radiance datasets. For example, the gdasanal has dthin=50 km for avhrr whereas the stand-alone run has dthin=145 km
- the gdasanal job assimilates ompstc8_n20. This dataset is not in the stand-alone job obs_input namelist

If someone can point me at the GSI configuration we want to use for gdas-validation, I can create a JEDI-T2O branch in which config.anal is updated to replicate the target configuration.

emilyhcliu commented 8 months ago

We are using the focus cycle (2021080100) for our evaluation and should be using the focus cycle for the code sprint. So, people can use the 2021080100 geoval and obs files from the UFO evaluation as a reference. These files contain GSI output (e.g. HofX and some derived variables).

The GSI workflow for the code sprint provides a way for people to re-run the focus cycle the GFS processing from prep step to observer part of the first outer loop. And people can configure it to run with different configurations or cycle. For example, for me (working on radiances), I can change the all-sky related namelist and configuration files (anavinfo, satinfo, ...etc) for the latest update (ta2tb is true and use updated anavinfo... etc) in the operational system.

@RussTreadon-NOAA I think we should keep namelist setting and configuration as the same as we run the 2021080100. For the code sprint, we are not seeking bit-identical result since we will be checking end-to-end comparion between GDAS and JEDI. They are some fundamental difference between the two in current status. So, bottom line, we need to have the following setup in the GSI workflow: nvqc = .false.
FGAT should be off.

We should also turn off Hilbert curve for aircraft data. @CoryMartin-NOAA and @ADCollard found that the switch for Hilbert curve is hardwired in GSI. So, we need to turn it off in the code, not from the script. @CoryMartin-NOAA and @ADCollard, could you give guidence for this?

ADCollard commented 8 months ago

@emilyhcliu @RussTreadon-NOAA The Hilbert Curve code starts at line 3007 of read_prepbufr.f90.

! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv') then
     vmin=-10.00_r_kind
     vmax=18000.00_r_kind
     nor=0
     ...

The entire if-block statrting if(obstype == 'uv') then should be commented out for now.

RussTreadon-NOAA commented 8 months ago

Namelist OBSQC contains logical variable hilbert_curve. gsi.x defaults this variable to .false.. What if we allow gsi.x to default hilbert_curve to .false. and change line 3010 to read

! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv' .and. hilbert_curve) then

Would this suffice?

CoryMartin-NOAA commented 8 months ago

Namelist OBSQC contains logical variable hilbert_curve. gsi.x defaults this variable to .false.. What if we allow gsi.x to default hilbert_curve to .false. and change line 3010 to read
! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv' .and. hilbert_curve) then
Would this suffice?

Not only would it suffice, but we should probably add this to the develop branch....

RussTreadon-NOAA commented 8 months ago

One question, one suggestion, and one request

Do we want the baseline gdas-validation gsi.x to reproduce the operational GFS v16.3.x or NOAA-EMC/GSI develop (looking ahead to GFS v17)?
Currently I build gsi.x from the forked GSI-crtm_v2.4.1-jedi.1. It seems preferable to merge this forked branch into a NOAA-EMC/GSI gdas-validation branch, tag it, and update JEDI-T2O to clone and build the tag. What do you think?
Would one of you run 2021080100 with exactly the baseline configuration we want all developers to start from? Point me at the script(s), log file, and run directory for the baseline run and I will update JEDI-T2O gdas-validation to reproduce the baseline.

RussTreadon-NOAA commented 8 months ago

Namelist OBSQC contains logical variable hilbert_curve. gsi.x defaults this variable to .false.. What if we allow gsi.x to default hilbert_curve to .false. and change line 3010 to read
! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv' .and. hilbert_curve) then
Would this suffice?
Not only would it suffice, but we should probably add this to the develop branch....

OK, we can open an issue and get it into develop

RussTreadon-NOAA commented 8 months ago

Namelist OBSQC contains logical variable hilbert_curve. gsi.x defaults this variable to .false.. What if we allow gsi.x to default hilbert_curve to .false. and change line 3010 to read
! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv' .and. hilbert_curve) then
Would this suffice?
Not only would it suffice, but we should probably add this to the develop branch....
OK, we can open an issue and get it into develop

I tested this change in GSI tag gfsda.v16.3.10 using the operational 2023103100 gdas cycle. Adding hilbert_curve to the logical test increased the initial (obs-ges) uv penalty by 39.7%. With only if (obstype == 'uv') then the o-g uv penalty is 0.252290097411315393E+06. After adding .and. hilbert_curve to the logical test the uv penalty increased to 0.352427974894839805E+06.

Operations run the global GSI with logical hilbert_curve=.false.. Thus, by adding hilbert_curve to the logical test, the uv block in question is not entered.

I'm confused. The original code in read_prepbufr.f90 reads

! the following is gettin the types which will be applied hilbert curve to                                               
!  estimate the density                                                                                                  

  if(obstype == 'uv') then

The comment in the code along with @ADCollard 's guidance suggest that this block should only be executed when hilbert_curve=.true. I assumed this is how we run gsi.x in the GFS. This isn't the case. We execute this block in operations for all uv observations processed by read_prepbufr.f90. Is this what we want to happen?

RussTreadon-NOAA commented 8 months ago

Seems my understanding of logical hilbert_curve is not correct. gsimod.F90 contains the comment

!     hilbert_curve - option for hilbert-curve based cross-validation. works only                                        
!                     with twodvar_regional=.true.

Logical hilbert_curve is for cross-validation in 2DVar regional mode. It's not a variable for global GSI runs. Given this, my suggestion to add hilbert_curve to the logical in read_prebufr.f90 is wrong. We must do as @ADCollard said. The entire uv block needs to be commented out. Alternatively, we could add a new logical to bypass the block. Is there any benefit from adding a new logical apart from ease during gdas-validation?

RussTreadon-NOAA commented 8 months ago

The GDAS-validation sprint begins Monday, 11/13. Next week (6-10 Nov) is a short work week (Friday, 11/10 is the Veterans day holiday). I'm not available 11/10 through 11/12.

Work remains to prepare GDAS-validation for easy use by developers. Some of this work involves others (e.g., how to lower gsi.x wall times on Orion following the 10/23 PM). Other work is ours.

Here's a partial listing of our work items:

remove diagnostic prints from CRTM library
ensure CRTM fix contains all desired CRTM coefficients
finalize GSI configuration for GDAS-validation. Need to sort out fix files, script, & source code updates
create GSI branch, including fix, with target GSI configuration
configure GDASApp fv3jedi_var.x to use the same CRTM coefficients as gsi.x. Seems we should also build fv3jedi_var.x with same CRTM module as gsi.x, right?
create a GDASApp branch with the GDAS-validation configuration. This is the branch to which developers commit updates to yamls and bufr2ioda converters during the sprint.

What other pre-sprint work should be added to the above list?

CoryMartin-NOAA commented 8 months ago

Given that Orion is slow, do we move to Hera? I think the savings in runtime will be offset by the longer job queues though...

Thanks @RussTreadon-NOAA I think this is a good list. The biggest one is item 3 , @ADCollard and @emilyhcliu what all should we turn off in GSI? I know we need to turn off FGAT, the time thinning error inflation, VarQC. But what else?

RussTreadon-NOAA commented 8 months ago

I'm wrestling with the move to Hera, too. The 11/13 sprint is step one of gdas validation in that it will focus on the observer (ufo), right? If true, we will have a step two gdas validation at a later date where we compare gsi.x & fv3jedi_var.x minimization (solver) including varbc. Not all this work will be done on Orion (or Hercules). So long term I think we want to extend setup_workspace.sh to Hera.

The gdas validation sprint isn't the only thing DAD staff are working on. Are we using using Hera for GFS v17 tests? GFS v17 includes JEDI based marine, land, and aerosol DA. Maybe we reserve Hera for GFS v17 & related JEDI work and keep gdas validation on Orion for the time being.

Thoughts? Comments?

CoryMartin-NOAA commented 8 months ago

I think the extension to Hera is fairly straightforward, the only real sticking point would be mirroring the input data to Hera from Orion. We would need to stage FMS restarts, Gaussian history files, bias correction files, (and observations should be in the glopara space already). This isn't difficult, but it does take up space. Hera space is at a premium compared to Orion.

I also agree that Hera is probably better spent on the GFS T2O specific tasks and use Orion for this lower readiness level testing. GSI may be running slow on Orion, but it still runs.

RussTreadon-NOAA commented 8 months ago

Agreed. The GSI still runs on Orion ... it's just slow. One suggestion from the Orion helpdesk is to recompile the stack we use to build GSI. I asked in g-w issue #1996 about getting this done. GSI slowness on Orion will be addressed. We also have the possibility of using Hercules in the future, though it seems some executables also run slow on Hercules.

RussTreadon-NOAA commented 8 months ago

Conduct the following test on Orion.

clone feature/GSI-crtm_v2.4.1-jedi.1 from https://github.com/emilyhcliu/GSI.git
edit modulefiles/gsi_orion.lua to replicate as much as possible GDASApp modulefiles/GDAS/orion.lua. Look at /work2/noaa/da/rtreadon/gdas-validation-test/global-workflow/sorc/gsi_enkf.fd_jedi/modulefiles/gsi_orion.lua to see the modified gsi_orion.lua.
add required bufr/12 changes to src/gsi/CMakeLists.txt and src/gsi/read_prepbufr.f90
execute ush/build.sh. Both gsi.x and enkf.x were built
change GSIDIR in submit_gsi_observer.sh to point at the above mentioned gsi_enkf.fd_jedi
comment out LD_LIBRARY_PATH in gsi_observer.sh which pointed at Emily's crtm_v2.4.1-jedi.1
execute ./submit_gsi_observer.sh

The job ran to completion. The run directory is /work2/noaa/da/rtreadon/ufoeval/GSIobserver/2021080100/gsi_spack_build_cory_crtmfix. Also run with Emily's original configuration. The run directory for this run is /work2/noaa/da/rtreadon/ufoeval/GSIobserver/2021080100/gsi_hpc_build_emily_crtm

The fort.2* stats are identical between the two runs with the exception of fort.207. The total radiance penalties differ in the 14th printed digit. There are no differences in the counts for assimilated radiance observations.

With the above changes gsi.x is built using the same spack-stack and crtm library as fv3jedi_var.x. The CRTM_FIX used for the spack-stack gsi.x run above was /work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0. This is the same CRTM_FIX used by fv3jedi_var.x when run by g-w.

emilyhcliu commented 8 months ago

I am also using feature/GSI-crtm_v2.4.1-jedi.1, crtm_v2.4.1-jedi.1, and the CRTM coefficients from GDASApp: /work2/noaa/da/cmartin/GDASApp/fix/crtm/2.4.0 to generate test data for UFO Evaluation.

RussTreadon-NOAA commented 8 months ago

Great! Shall we update gsi_orion.lua in a gdas-validation specific branch for the purpose of the upcoming sprint?

RussTreadon-NOAA commented 8 months ago

The following branches have been created in the following repositories for possible use in the GDAS validation sprint:

JEDI-T2O:feature/gdas-validation. To see differences with respect to develop, click here.
- update config.anal to configure GSI for GDAS validation
- update config.atmanl to be consistent with recent changes to g-w develop
- update config.prep to be consistent with recent changes to g-w develop
- update config.resources to be consistent with recent changes to g-w develop
- reorder entries in config_jedi.yaml to be consistent with order used in config_gsi.yaml
- update GSI clone in setup_workspace to checkout feature/gdas-validation, add error trapping to GSI and GDASApp builds
GSI:/feature/gdas-validation. To see differences with respect to develop, click here.
- update Orion modulefile to use Emily's crtm_v2.4.1-jedi.1

It is not clear if all the changes in JEDI-T2O branch feature/gdas-validation need to be present for GDAS validation. How do we want gsi.x configured for GSI validation? The current settings in config.anal may not be correct or complete.

Two additional considerations

We may want to create a GDASApp feature/gdas-validation branch for GDAS validation.
- We may also need to create a g-w feature/gdas-validation branch. For example, g-w exglobal_atmos_analysis.sh copies (links) CRTM fix file CloudCoeff.GFDLFV3.-109z-1.bin to local file CloudCoeff.bin. In contrast, g-w parm/gdas/atm_crtm_coeff.yaml copies CRTM fix file CloudCoeff.bin to local file CloudCoeff.bin. These are different cloud coefficient files. If we want the same cloud coefficient file used by gsi.x and fv3jedi_var.x, we need to edit g-w exglobal_atmos_anlaysis.sh or atm_crtm_coeff.yaml

RussTreadon-NOAA commented 8 months ago

@emilyhcliu @RussTreadon-NOAA The Hilbert Curve code starts at line 3007 of read_prepbufr.f90.
! the following is gettin the types which will be applied hilbert curve to
!  estimate the density

  if(obstype == 'uv') then
     vmin=-10.00_r_kind
     vmax=18000.00_r_kind
     nor=0
     ...
The entire if-block statrting if(obstype == 'uv') then should be commented out for now.

Lines 3007 to 3165 have been commented out in the snapshot of read_prepbufr.f90 in feature/gdas-validation. Done at 7ef942c3.

NOAA-EMC / JEDI-T2O

Create a GSI branch with configuration to use CRTM-2.4.1-jedi.1 #93