ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

Creation of ACCESS-TX data assimilation Rose-cylc suite on Raijin #170

Open penguian opened 9 years ago

penguian commented 9 years ago

| by pag548@nci.org.au


As part of the ITF project, create a data assimilation suite to provide initial conditions for Xingbao's 4.4km UM model for the domain encompassing the North West shelf.

Data assimilation will use the APS2 set of observations, using the APS2 ACCESS-R perturbation forecast model.


Issue migrated from trac:170 at 2024-01-31 18:10:51 +1100

penguian commented 9 years ago

@scott.wales@bom.gov.au changed status from new to assigned

penguian commented 9 years ago

@scott.wales@bom.gov.au changed owner from ` topag548`

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Attempt to get Wenming's UM task xbnmy to output _ca0 files for input into VAR.

This task is run within the rose suite access_8.2_1.5k_r12.

Initially, STASHC for xbnmy is void of all OPS and VAR macros. Strangely, STASHC for xbnms (which is used for reconfiguration) contains entries for CX fields macro and LS fields macro.

First trial is to copy the STASH entries from xbnms into xbnmy. Append the following to /short/dp9/pag548/access_8.2_1.5k_r12/bin/um_run.sh

export PPVAR=$DATAW/$PREFIXW$RUNID.ppvar

The forecast task succeeds and creates two output files

prefixwS.cxbkgerr

prefixwS.ppvar

The first file seems to be a cx-background file. It contains data at times 2015010103/05/06/08Z

The second file seems to be a linearisation file. It contains two entries at 2015010108/17Z

Question 1) How to create separate separate linearisation files of the form jobid_ca006/07/08/09/10/11/12 as required by VAR?

Attempt to modify output times of output of linearisation states by replacing

 &TIME_NAME="LSTimes", ITYP=1
 IOPT=2
 ITIMES=3,UNT3="H ",
 ISER= 5, 14, 26,
 /

with

 &TIME_NAME="LSTimes", ITYP=1
 IOPT=1,
 ISTR=1,IEND=6,IFRE=1,UNT3="H",
/

These STASHC parameters are sourced from the R12 Ngamai UM task xbnkq.

Using this modified STASHC file, the UM job fails on the 5th forecast hour.

/short/dp9/pag548/access_8.2_1.5k_r12_S/2015010100/SW> tail um.fort6.pe/um.fort6.pe0 

********************************************************************************

MPPIO: Open: xbnmya_pa005 on unit  60                              
 Between timestep    216 and    432 average iterations =       5.815
 Iterations: Max #   7 at timestep    220. Min   5 at timestep    219
MPPIO: Open: xbnmya_pc005 on unit  62                              
MPPIO: Open: xbnmya_pe005 on unit  64                     

From /short/dp9/pag548/access_8.2_1.5k_r12_S/2015010100/SW/um_output

 vt params    1042.18000000000        1.00000000000000     
   14.2611000000000       0.416351000000000     
 vt params    1042.18000000000        1.00000000000000     
   14.2611000000000       0.416351000000000     
OPEN:  Claimed 4194304 Bytes (524288 Words) for Buffering
OPEN:  Buffer Address is                   F20E4040
OPEN:  Claimed 4194304 Bytes (524288 Words) for Buffering
OPEN:  Buffer Address is                   F2BB5040
OPEN:  Claimed 4194304 Bytes (524288 Words) for Buffering
OPEN:  Buffer Address is                   F4176040
gc_abort (Processor     3): Job aborted from ereport.
gc_abort (Processor     2): Job aborted from ereport.
gc_abort (Processor    11): Job aborted from ereport.

How can we configure task xbnmy to produce correct cx.background and ls-fields-dir information?

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Attempt to use SREP suite as basis for ACCESS-TCX Begin with au-aa202, which is my version of au-aa168 with minor changes (e.g. project is set to dp9 instead of dp7)

Rosie-go - copy au-aa202 to au-aa314.

Change the definitions of UM_BIN and UM_RECON_EXE to use Xingbao's executables. Use SVN DIFF to check the only differences between the two suites.

[pag548@accessdev au-aa314]$ svn diff --old=^/a/a/2/0/2/trunk --new=.
Index: bin/envfile
===================================================================
--- bin/envfile (svn+ssh://accessdev.nci.org.au/home/access-svn/roses_au_svn/a/a/2/0/2/trunk)   (revision 1057)
+++ bin/envfile (working copy)
@@ -10,6 +10,7 @@
 #                   ODB_VERSION (compatible with odb/30.0.2) and OPS_SATWINDNL_DIR
 #  0.13  05/02/15   Updates to allow suite.rc to call ODB tasks to seperate ODB generation from
 #                   get_bufr task. Use reference to AddHoursScript and AddDayScript
+#  0.14  07/05/15   rosie copy of au-aa202@[977] ;this suite is au-aa314@[1056]
 #----------------------
 # Shared Repositories
 #----------------------
@@ -747,9 +748,11 @@
 #
 # UM tasks
 #
-export UM_BIN=/short/du7/ycx548/data/um/bin/UKV/PS32/bin
+#export UM_BIN=/short/du7/ycx548/data/um/bin/UKV/PS32/bin #Original SREP
+export UM_BIN=/short/dp9/pag548/roses/access_8.2_1.5k_r12/beans/um #Test using Xingbao's executable
 export UM_EXE=$UM_BIN/UM8.2_UKV_PS32.exe
-export UM_RECON_EXE=$UM_BIN/qxreconf
+#export UM_RECON_EXE=$UM_BIN/qxreconf #Original SREP
+export UM_RECON_EXE=/short/dp9/pag548/roses/access_8.2_1.5k_r12/beans/umr/qxreconf
 export UM_82_ENV=/home/548/ycx548/.um_82_env.163
 # does not seem to work for ConfigLS
 #export VAR_RC_PROG=$UM_RECON_EXE
Index: rose-suite.info
===================================================================
--- rose-suite.info     (svn+ssh://accessdev.nci.org.au/home/access-svn/roses_au_svn/a/a/2/0/2/trunk)   (revision 1057)
+++ rose-suite.info     (working copy)
@@ -1,5 +1,4 @@
 access-list=*
 owner=pag548
 project=dp9
-my_project=dp9
-title=Copy of au-aa168: rt_xdm3d_rw
+title=Copy of au-aa202: ACCESS TCX suite based on SREP

Attempt to cold start for 2014101204.

Reconfiguration seems correct after task L2H: /home/548/pag548/cylc-run/au-aa314/log/job> xxdiff L2H.2014101204.3.UM/xbjis000.xbjis.d15127.t143338.rcf.leave ../../../au-aa202/log/job/L2H.2014101204.1.UM/xbjis000.xbjis.d15127.t100638.rcf.leave

iau-start-dumps are identical in size but I can't view them in xconv (because they are stretched grids?)

/short/dp9/pag548/work/au-aa202/2014101203/run/staging> ls -lt iau-start-dump 
-rw-rw----+ 1 pag548 dp9 3033870336 May  7 10:17 iau-start-dump

/short/dp9/pag548/work/au-aa314/2014101203/run/staging> ls -lt iau-start-dump 
-rw-rw----+ 1 pag548 dp9 3033870336 May  7 15:33 iau-start-dump

InitialFC for au-aa314 runs. The UM job output are almost the same

/home/548/pag548/cylc-run/au-aa314/log/job/InitialFC.2014101204.7.UM> xxdiff xbjiu000.xbjiu.d15127.t150806.leave ../../../../au-aa202/log/job/InitialFC.2014101204.1.UM/xbjiu000.xbjiu.d15127.t101227.leave &
]
The difference is in the time taken

I diff the output from both jobs and the main differences are
1. au-aa314 contains no OpenMP specification
2. au-aa314 contains less IOS Async information

[{
/short/dp9/pag548/work/au-aa314/2014101203/run/um-output/dataw> xxdiff pe_output/xbjiu.fort6.pe0 ../../../../../au-aa202/2014101203/run/um-output/dataw/pe_output/xbjiu.fort6.pe0

In file "pe_output/xbjiu.fort6.pe0":
------------------------------
11: I am PE     0 on [155]

In file "../../../../../au-aa202/2014101203/run/um-output/dataw/pe_output/xbjiu.fort6.pe0":
------------------------------
11: I am PE     0 on [163]
12: I am running with  1 thread(s).
13: OpenMP Specification: 201107

In file "../../../../../au-aa202/2014101203/run/um-output/dataw/pe_output/xbjiu.fort6.pe0":
------------------------------
40: IOS: Info: Async Stash Dispatch slots =  10
41: IOS: Info: Async fields/levs per pack =  76
42: IOS: Info: Async send empty tiles     = F
43: IOS: Info: Async stats profiling      = F

In conclusion, job output seems identical (except for runtime) which is presumably due to SREP executable being compiled with Asynchronous I/O? However I don't know how to view the pp forecast files as xconv doesn't work on the variable grid.

penguian commented 9 years ago

@paul.gregory@bom.gov.au changed _comment0 which not transferred by tractive

penguian commented 9 years ago

@paul.gregory@bom.gov.au _uploaded file xbjiua_pa005_aa202-10mwindU.png (234.7 KiB)_

au-aa202 10m winds

penguian commented 9 years ago

@paul.gregory@bom.gov.au _uploaded file xbjiua_pa005_aa202-temp1-5M.png (203.6 KiB)_

au-aa202 1.5 m temperature

penguian commented 9 years ago

@paul.gregory@bom.gov.au _uploaded file xbjiua_pa005_aa314-10mwindU.png (234.7 KiB)_

au-aa314 10m winds

penguian commented 9 years ago

@paul.gregory@bom.gov.au _uploaded file xbjiua_pa005_aa314-temp1-5M.png (203.6 KiB)_

au-aa314 1.5 m temperature

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Examination of forecast file xbjiua_pa005 from both suites aa202 and aa314 using xconv/1.92 on ngamai show it is identical.

xbjiua_pa005_aa202-10mwindU.png,800px au-aa202 10m Winds (U component)

xbjiua_pa005_aa314-10mwindU.png,800px au-aa314 10m Winds 1.5M temp

xbjiua_pa005_aa314-temp1-5M.png,800px au-aa314 10m Winds (U component)

xbjiua_pa005_aa202-temp1-5M.png,800px au-aa202 10m Winds 1.5M temp

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Update to au-aa314/bin/envfile to attempt to run reconfiguration (task L2H) using ACCESS-TCX grids specified in /short/dp9/pag548/roses/access_8.2_1.5k_r12/beans/um/SW.um.nl

Original SREP UM executables are used.

Differences between the envfile and a standard SREP envfile can be viewed here: https://accessdev.nci.org.au/trac/changeset?reponame=roses&new=1062%40a%2Fa%2F3%2F1%2F4%2Ftrunk%2Fbin%2Fenvfile&old=977%40a%2Fa%2F2%2F0%2F2%2Ftrunk%2Fbin%2Fenvfile

L2H fails with a segmentation fault:

/short/dp9/pag548/work/au-aa314/2014101204/run/um-output/dataw/recon_atm.parexe.22738: line 417: 22783 Aborted                 (core dumped) /short/du7/ycx548/data/um/bin/UKV/PS32/bin/qxreconf

No other error messages present in log:

/home/548/pag548/cylc-run/au-aa314/log/job/L2H.2014101204.9.UM/xbjis000.xbjis.d15131.t103314.rcf.leave

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
qxreconf           0000000000526223  rcf_h_int_init_bl         705  rcf_h_int_init_bl_mod.f90
qxreconf           00000000004775B3  rcf_init_h_interp         105  rcf_init_h_interp_mod.f90
qxreconf           000000000046CD5F  rcf_control_mod_m         109  rcf_control_mod.f90
qxreconf           0000000000426128  MAIN__                     87  reconfigure.f90
qxreconf           000000000042602C  Unknown               Unknown  Unknown
libc.so.6          00002AC454F7FD5D  Unknown               Unknown  Unknown
qxreconf           0000000000425F39  Unknown               Unknown  Unknown
/short/dp9/pag548/work/au-aa314/2014101204/run/um-output/dataw/recon_atm.parexe.13489: line 417: 13537 Aborted 

Error occurs in rcf_h_init_init_bl_mod.f90 line 705. Code header

!+ Initialises bilinear interpolation weights

Module Rcf_H_Int_Init_BL_Mod

!  Subroutine Rcf_H_Int_Init_BL
!
! Description:
!   Sets up the interpolation and rotation weights for bilinear
!   horizontal interpolation.
!
! Method:
!   Weights stored in the interp_weights_mod module
!   Seperate calculations for P, U, V and P zonal points.
!
!   Based (with many changes) on UM4.5 code.
!
! Code Owner: See Unified Model Code Owner's HTML page
! This file belongs in section: Reconfiguration

Relevant section

!--------------------------------------------------------------
! 5: Weights and indices for zonal mean P points:
!--------------------------------------------------------------
! 5.1: Lat and lon of target grid

!If output grid is variable resolution 
!for phi_out for OZONE_ZONAL. C.Wang 17/05/07
If (L_vargrid_target) Then
  lambda_out(1) = hdr_out % ColDepC(1,1)
  Do j=1, grid_out % glob_p_rows
    phi_out(j)   =hdr_out % RowDepC(j,1)
  End Do
Else
  lambda_out(1) = start_lon_target
  Do j=1, grid_out % glob_p_rows
    phi_out(j) = start_lat_target + delta_lat_target * (j+p_offset_out)
  End Do
End If
penguian commented 9 years ago

@paul.gregory@bom.gov.au changed _comment0 which not transferred by tractive

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Segmentation fault fixed.

In output file /home/548/pag548/cylc-run/au-aa314/log/job/L2H.2014101204.12.UM/xbjis000.xbjis.d15132.t144049.rcf.leave

 Sizes namelists file: 
 /short/dp9/pag548/work/au-aa314/2014101204/tmp/xbjis.sizes                     

 &NSUBMODL
 N_INTERNAL_MODEL        =                     1,
 N_SUBMODEL_PARTITION    =                     0,
 INTERNAL_MODEL_LIST     =                     1,                     0,
 SUBMODEL_FOR_IM =                     1,                     0
 /
 &NLSIZES
 GLOBAL_ROW_LENGTH       =                     0,
 GLOBAL_ROWS     =                     0,

Examination of the subroutine rcf_h_init_init_bl_mod.f90 suggested the variables GLOBAL_ROWS and GLOBAL_ROW_LENGTH are important.

The comparable section for an SREP run (file /home/548/pag548/cylc-run/au-aa202/log/job/L2H.2014101204.1.UM/xbjis000.xbjis.d15127.t100638.rcf.leave contained

 GLOBAL_ROW_LENGTH       =                     648,
 GLOBAL_ROWS     =                     720,

Original SREP grid definition in the envfile contained

export UM_ROWLEN=648
export UM_ROWS=720

For our suite envfile we copied grid data from short/dp9/pag548/roses/access_8.2_1.5k_r12/beans/um/SW.um.nl which contained

export GLOBAL_ROW_LENGTH=1112
export GLOBAL_ROWS=696

And replaced original SREP grid definition environment variables.

So appending the following to our suite file

#Add varibles UM_ROWS and UM_ROWLEN to be consistent with SREP
export UM_ROWS=$GLOBAL_ROWS
export UM_ROWLEN=$GLOBAL_ROW_LENGTH

Gives the correct readout in /home/548/pag548/cylc-run/au-aa314/log/job/L2H.2014101204.13.UM/xbjis000.xbjis.d15132.t163605.rcf.leave

 GLOBAL_ROW_LENGTH       =                    1112,
 GLOBAL_ROWS     =                     696,

Code no longer exits at segmentation fault.

To do : Fix ancillary files

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


make_bc task exceeds the default SREP wall clock time.

The settings used in Wenming's suite.rc for his make_lbc task is

-l = 'ncpus=8,walltime=02:00:00,mem=16GB'

These have been copied to the suite.rc in au-aa314

   [[ make_bc ]]
       inherit = remote_cycling
       retry delays = 2
       title = create lbc
       description = "create lbc"
       [[[job submission]]]
            method = pbs
       [[[directives]]]
            -q = normal
            -P = [ environ['PROJECT'] ]
            -l = 'ncpus=8,walltime=02:00:00,mem=16GB'

However, they are being over-ridden by another section of the suite.

If you look at the actual job submission script, the walltime, npcus and memory options have not been changed

/home/548/pag548/cylc-run/au-aa314/log/job> more make_bc.2015041200.39
#!/bin/bash

# ++++ THIS IS A CYLC TASK JOB SCRIPT ++++
# Task 'make_bc.2015041200' in suite 'au-aa314'
# Job submission method: 'pbs'

# DIRECTIVES:
#PBS -e cylc-run/au-aa314/log/job/make_bc.2015041200.39.err
#PBS -l ncpus=1,mem=15G,walltime=00:50:00
#PBS -o cylc-run/au-aa314/log/job/make_bc.2015041200.39.out
#PBS -N make_bc.2015041
#PBS -q normal
#PBS -P dp9

So the task still fails because of exceeding the walltime limit

/home/548/pag548/cylc-run/au-aa314/log/job> more make_bc.2015041200.36.err 
Currently Loaded Modulefiles:
  1) pbs                   5) xxdiff/4.0            9) intel-cc/12.1.8.273
  2) dot                   6) cylc/5.4.14          10) openmpi/1.6.3
  3) python/2.7.3          7) rose/2014-05
  4) idl/8.2               8) intel-fc/12.1.8.273
Currently Loaded Modulefiles:
  1) pbs                   5) xxdiff/4.0            9) intel-cc/12.1.8.273
  2) dot                   6) cylc/5.4.14          10) openmpi/1.6.3
  3) python/2.7.3          7) rose/2014-05
  4) idl/8.2               8) intel-fc/12.1.8.273
=>> PBS: job killed: walltime 3029 exceeded limit 3000
Terminated

​ Are there any suite or environment settings that are overriding the make_bc task definition in suite.rc?

au-aa314/bin/make_bc.ksh calls au-aa314/bin/scripts/make_lbc_78.ksh which calls au-aa314/bin/scripts/qsub_makelbc_78.ksh

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Attempt to use frames instead of standard make_bc scripts.

Scripts point to export frame_nml=/home/548/ycx548/keep/scripts/makebc_template.nml

which no longer exists.

Swap with export frame_nml=$lbc_namelist. Generates error

[2575] cannot access MDSS.  Please run mdss on a host which can.
cylc (scheduler - 2015/05/29 12:01:11): CRITICAL  make_lbc.sh failed at 2015041200 at 2015-05-29T12:01:11
[FAIL] make_bc.ksh $CYLC_SUITE_DEF_PATH/bin/envfile # return-code=1
Received signal ERR

Relevant section of the script

#
# prepare sam archive dir
#
mdss_frame_dir=${USER}/frames/${MY_LAM}
mdss mkdir -p ${mdss_frame_dir}

Peter Steinle says the qsub option -q copyq and -lother=mdss should be provided to qsub options when transferring data to/from mdss directories. Amend script to

qrsh -q copyq -lother=mdss "mdss mkdir -p ${mdss_frame_dir}"

Still fails

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Robin says that netmv will automatically create target file/directory on mdss according to the man pages for netmv and netcp.

Attempting to run frames using the command

jobid=`qsub -q copyq -lother=mdss -P ${PROJECT} -V -o $frame_dir -e $frame_dir $frame_dir/frames.qsb`

fails. According to stderr

short/dp9/pag548/work/au-aa314/2015041200/frame/SW> more 864002.r-man2.ER 
+ ulimit -s unlimited
+ echo frame_infile=/short/dp9/pag548/work/au-aa314/2015041200/frame/SW/qwxbjva_pi002_2015041112_utc_fc.um
+ echo frame_ofile=/short/dp9/pag548/work/au-aa314/2015041200/frame/SW/frame_qwxbjva_pi002_2015041112_utc_fc.um
+ echo frame_nl=/short/dp9/pag548/work/au-aa314/2015041200/frame/SW/frame.nml
+ /short/du7/ycx548/keep/um/vn8.2/normal/utils/framesx -p -n /short/dp9/pag548/work/au-aa314/2015041200/frame/SW/frame.nml -i /short/dp9/pag548/work/au-aa314/2015041200/frame/SW/qwxbjva_pi002_2015041112_
utc_fc.um -ow /short/dp9/pag548/work/au-aa314/2015041200/frame/SW/frame_qwxbjva_pi002_2015041112_utc_fc.um
Error - no temporary directory

How do I specify the temporary directory? According to UM8.2 frames documentation there is no need to specify this

http://ngamai04.bom.gov.au/~access/umdoc_systems/umdoc_system_8.2/UM_docs/papers/html/F57/node3.html

However this page suggests that $TMPDIR must be specified http://ngamai04.bom.gov.au/~access/umdoc_systems/umdoc_system_8.2/UM_docs/papers/html/F57/node4.html

From frames launcher script

# Directory for intermediate files
UM_TMPDIR=${UM_TMPDIR:-${SCRATCH:-$TMPDIR]
if [[ ! -d $UM_TMPDIR ]] ; then
  echo "Error - no temporary directory" >&2
  exit 1
fi
penguian commented 9 years ago

@paul.gregory@bom.gov.au changed _comment0 which not transferred by tractive

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Frames code can run but fails in reading namelist.

/short/dp9/pag548/work/au-aa314/2015041200/tmp> more frames_out.pag548.d15153.t091415.17169
...
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: check_iostat
? Error Code:    17
? Error Message:  Error reading namelist INTFCNSTA. Please check input list against code.
? Error generated from processor:     0
? This run generated   1 warnings
????????????????????????????????????????????????????????????????????????????????
...
/short/du7/ycx548/keep/um/vn8.2/normal/utils/framesx: line 299: 17176: Abort

I have no source code available for the frames code so its difficult to see where the error is.

Note the original namelist provided in the SREP envfile :

export frame_nml=/home/548/ycx548/keep/scripts/makebc_template.nml

No longer exists.

I have copied the namelist used to make lbc's but without a reference I'm unsure if this is suitable for the frames executable.

export lbc_namelist=/short/dp9/pag548/roses/access_8.2_1.5k_r12/beans/makebc/makebc.$MY_LAM.nml #Xingbao's job
export frame_nml=$lbc_namelist
penguian commented 9 years ago

@paul.gregory@bom.gov.au uploaded file L2H-output-2014111200.png (26.3 KiB)

Surface temperature fields for reconfigured iau-start-dump

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Attempt to run InitialFC using LBCs created by Xingbao using make_lbc.

Reconfiguration step L2H succeeded and domain looks correct, although surface temperatures are not.

L2H-output-2014111200.png,800px

InitialFC fails with message just after opening the LBC file

OPEN:  File /short/dp9/pag548/work/au-aa314/2015041200/input/lbc6.2015041200 to be Opened on Unit 125 Exists
forrtl: severe (174): SIGSEGV, segmentation fault occurred

Comparing it the output to an SREP InitialFC task shows a difference in the header information read from the lbc file

For the ACCESS TCX run

MPPIO: Open: /short/dp9/pag548/work/au-aa314/2015041200/input/lbc6.2015041200 on unit 125                              
MPPIO: from environment variable ALABCIN1

 FIXED LENGTH HEADER
 -------------------
 Dump format version-32768
 UM Version No         802
 Atmospheric data
 Charney-Phillips on radius levels
 Over global domain
 Boundary dataset
 Exp No =-32768 Run Id =     0
 Gregorian calendar
 Arakawa C grid
                       Year  Month Day Hour Min  Sec  DayNo  
 First Validity time = 2015    4   11   12    0    0    101
 Last  Validity time = 2015    4   16   14    0    0    106
 Interval            =    0    0    0    0   60    0      0
                        Start     1st dim    2nd dim    1st parm    2nd parm
 Integer Consts           257         46                    46
 Real Consts              303         38                    38
 Level Dep Consts         341         71          4         71          4
 Row Dep Consts        -32768     -32768     -32768     -32768     -32768
 Column Dep Consts     -32768     -32768     -32768     -32768     -32768
 Fields of Consts      -32768     -32768     -32768          1          1
 Extra Consts          -32768     -32768                     1
 History Block         -32768     -32768                     1
 CFI No 1              -32768     -32768                     1
 CFI No 2              -32768     -32768                     1
 CFI No 3              -32768     -32768                     1
 Lookup Tables            625         64       1477         64       1477
 Model Data            524289          0                     0

 LEVEL DEPENDENT CONSTANTS
      284 64-bit words long

Compared to an SREP run

MPPIO: Open: /short/dp9/pag548/work/au-aa202/2014101204/input/lbc6.2014101204 on unit 125                              
MPPIO: from environment variable ALABCIN1

 FIXED LENGTH HEADER
 -------------------
 Dump format version-32768
 UM Version No         707
 Atmospheric data
 Charney-Phillips on radius levels
 Over rotated LAM domain
 Boundary dataset
 Exp No =-32768 Run Id =     0
 Gregorian calendar
 Arakawa C grid
                       Year  Month Day Hour Min  Sec  DayNo  
 First Validity time = 2014   10   11   21    0    0    284
 Last  Validity time = 2014   10   14    2    0    0    287
 Interval            =    0    0    0    1    0    0      0
                        Start     1st dim    2nd dim    1st parm    2nd parm
 Integer Consts           257         46                    46
 Real Consts              303         38                    38
 Level Dep Consts         341         71          4         71          4
 Row Dep Consts           625        720          2        720          2
 Column Dep Consts       2065        648          2        648          2
 Fields of Consts      -32768     -32768     -32768          1          1
 Extra Consts          -32768     -32768                     1
 History Block         -32768     -32768                     1
 CFI No 1              -32768     -32768                     1
 CFI No 2              -32768     -32768                     1
 CFI No 3              -32768     -32768                     1
 Lookup Tables           3361         64        649         64        649
 Model Data            524289          0                     0

 LEVEL DEPENDENT CONSTANTS
      284 64-bit words long

 ROW DEPENDENT CONSTANTS
     1440 64-bit words long

 COLUMN DEPENDENT CONSTANTS
     1296 64-bit words long

Values for Row and Column Dependent Constants in our ACCESS-TCX LBC are not defined.

UM Documentation available at : http://ngamai04.bom.gov.au/~access/umdoc_systems/umdoc_system_8.2/UM_docs/papers/html/S11/node5.html provides further information regarding these constants.

However it doesn't provide any ideas how to set these variables. Neither does the LBC documentation : http://ngamai04.bom.gov.au/~access/umdoc_systems/umdoc_system_8.2/UM_docs/papers/html/F54/node4.html#SECTION00041000000000000000

The ACCESS-TCX run fails at the point

************ This run uses a variable horizontal grid: l_regular = F ***********

  calling Set_coeff_lagrange for lambda_p/u

  calling Set_coeff_lagrange for phi_p/v

Note the ACCESS-TCX run is still trying to run on a variable grid, whereas it is fixed resolution of 0.036 degrees in lat/lon

I have attempted to change this by setting variable L_REGULAR=.TRUE. in UMUI job file CNTLATM. However the Rose and Cylc versions have been upgraded at accessdev and raijin so I have not been able re-submit the job

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Attempt to run Xingbao's suite with global inputs.

Task SW_recon_canht fails with error

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: Rcf_Set_Data_Source
? Error Code:    30
? Error Message: Section   0 Item    57 : Required field is not in input dump!
? Error generated from processor:     0
? This run generated   1 warnings
????????????????????????????????????????????????????????????????????????????????

From rcf.fort6.pe0

 Input grid

    0   70  307840   1     0     2     1  U COMPNT OF WIND AFTER TIMESTEP     
    0   70  307200   1     0     3    71  V COMPNT OF WIND AFTER TIMESTEP     
    0   70  307840   1     0     4   141  THETA AFTER TIMESTEP                
    1    4  104538   1     0     9   211  SOIL MOISTURE CONTENT IN A LAYER    
    0   70  307840   1     0    10   215  SPECIFIC HUMIDITY AFTER TIMESTEP    
    0   70  307840   1     0    12   285  QCF AFTER TIMESTEP                  
    0    1  307840   2     0    14   355  CONV CLOUD BASE LEVEL NO. AFTER TS  
    0    1  307840   2     0    15   356  CONV CLOUD TOP LEVEL NO. AFTER TS   
    0    1  307840   1     0    16   357  CONV CLOUD LIQUID WATER PATH        
    1    1  104538   1     0    17   358  SILHOUETTE OROGRAPHIC ROUGHNESS     
    1    1  104538   1     0    18   359  HALF OF  (PEAK TO TROUGH HT OF OROG)
    1    4  104538   1     0    20   360  DEEP SOIL TEMP AFTER TIMESTEP       
    0    1  307840   2     0    21   364  CCRad : Lowest conv. cld base layer 
    1    1  104538   1     0    22   365  CANOPY WATER AFTER TIMESTEP    KG/M2
    0    1  307840   1     0    23   366  SNOW AMOUNT OVER LAND AFT TSTP KG/M2
    0    1  307840   1     0    24   367  SURFACE TEMPERATURE AFTER TIMESTEP  
    0    1  307840   1     0    25   368  BOUNDARY LAYER DEPTH AFTER TIMESTEP 
    0    1  307840   1     0    26   369  ROUGHNESS LENGTH AFTER TIMESTEP     
    0    1  307840   1     0    28   370  SURFACE ZONAL CURRENT AFTER TIMESTEP
    0    1  307200   1     0    29   371  SURFACE MERID CURRENT AFTER TIMESTEP
    0    1  307840   3     0    30   372  LAND MASK (No halo) (LAND=TRUE)     
    0    1  307840   1     0    31   373  FRAC OF SEA ICE IN SEA AFTER TSTEP  
    0    1  307840   1     0    32   374  SEA ICE DEPTH (MEAN OVER ICE)      M
    0    1  307840   1     0    33   375  OROGRAPHY (/STRAT LOWER BC)         
    1    1  104538   1     0    34   376  STANDARD DEVIATION OF OROGRAPHY     
    1    1  104538   1     0    35   377  OROGRAPHIC GRADIENT XX COMPONENT    
    1    1  104538   1     0    36   378  OROGRAPHIC GRADIENT XY COMPONENT    
    1    1  104538   1     0    37   379  OROGRAPHIC GRADIENT YY COMPONENT    
    1    1  104538   1     0    40   380  VOL SMC AT WILTING AFTER TIMESTEP   
    1    1  104538   1     0    41   381  VOL SMC AT CRIT PT AFTER TIMESTEP   
    1    1  104538   1     0    43   382  VOL SMC AT SATURATION AFTER TIMESTEP
    1    1  104538   1     0    44   383  SAT SOIL CONDUCTIVITY AFTER TIMESTEP
    1    1  104538   1     0    46   384  THERMAL CAPACITY AFTER TIMESTEP     
    1    1  104538   1     0    47   385  THERMAL CONDUCTIVITY AFTER TIMESTEP 
    1    1  104538   1     0    48   386  SATURATED SOIL WATER SUCTION      **
    0    1  307840   1     0    49   387  SEA-ICE TEMPERATURE AFTER TIMESTEP  
    0   70     481   1     0    60   388  OZONE                             **
....

Note when using an access-r input startdump as the input grid, there are more fields (including field 57)

    0   70  773952   1     0     2     1  U COMPNT OF WIND AFTER TIMESTEP     
    0   70  772840   1     0     3    71  V COMPNT OF WIND AFTER TIMESTEP     
    0   70  773952   1     0     4   141  THETA AFTER TIMESTEP                
    1    1  322683   1     0     5   211  OROGRAPHIC GRADIENT  X COMPONENT    
    1    1  322683   1     0     6   212  OROGRAPHIC GRADIENT  Y COMPONENT    
    0    1  773952   1     0     7   213  UNFILTERED OROGRAPHY                
    1    4  322683   1     0     9   214  SOIL MOISTURE CONTENT IN A LAYER    
    0   70  773952   1     0    10   218  SPECIFIC HUMIDITY AFTER TIMESTEP    
    0   70  773952   1     0    12   288  QCF AFTER TIMESTEP                  
    0    1  773952   1     0    13   358  CONV CLOUD AMOUNT AFTER TIMESTEP    
    0    1  773952   2     0    14   359  CONV CLOUD BASE LEVEL NO. AFTER TS  
    0    1  773952   2     0    15   360  CONV CLOUD TOP LEVEL NO. AFTER TS   
    0    1  773952   1     0    16   361  CONV CLOUD LIQUID WATER PATH        
    1    1  322683   1     0    17   362  SILHOUETTE OROGRAPHIC ROUGHNESS     
    1    1  322683   1     0    18   363  HALF OF  (PEAK TO TROUGH HT OF OROG)
    1    4  322683   1     0    20   364  DEEP SOIL TEMP AFTER TIMESTEP       
    1    1  322683   1     0    22   368  CANOPY WATER AFTER TIMESTEP    KG/M2
    0    1  773952   1     0    23   369  SNOW AMOUNT OVER LAND AFT TSTP KG/M2
    0    1  773952   1     0    24   370  SURFACE TEMPERATURE AFTER TIMESTEP  
    0    1  773952   1     0    25   371  BOUNDARY LAYER DEPTH AFTER TIMESTEP 
    0    1  773952   1     0    26   372  ROUGHNESS LENGTH AFTER TIMESTEP     
    0    1  773952   1     0    28   373  SURFACE ZONAL CURRENT AFTER TIMESTEP
    0    1  772840   1     0    29   374  SURFACE MERID CURRENT AFTER TIMESTEP
    0    1  773952   3     0    30   375  LAND MASK (No halo) (LAND=TRUE)     
    0    1  773952   1     0    31   376  FRAC OF SEA ICE IN SEA AFTER TSTEP  
    0    1  773952   1     0    32   377  SEA ICE DEPTH (MEAN OVER ICE)      M
    0    1  773952   1     0    33   378  OROGRAPHY (/STRAT LOWER BC)         
    1    1  322683   1     0    34   379  STANDARD DEVIATION OF OROGRAPHY     
    1    1  322683   1     0    35   380  OROGRAPHIC GRADIENT XX COMPONENT    
    1    1  322683   1     0    36   381  OROGRAPHIC GRADIENT XY COMPONENT    
    1    1  322683   1     0    37   382  OROGRAPHIC GRADIENT YY COMPONENT    
    1    1  322683   1     0    40   383  VOL SMC AT WILTING AFTER TIMESTEP   
    1    1  322683   1     0    41   384  VOL SMC AT CRIT PT AFTER TIMESTEP   
    1    1  322683   1     0    43   385  VOL SMC AT SATURATION AFTER TIMESTEP
    1    1  322683   1     0    44   386  SAT SOIL CONDUCTIVITY AFTER TIMESTEP
    1    1  322683   1     0    46   387  THERMAL CAPACITY AFTER TIMESTEP     
    1    1  322683   1     0    47   388  THERMAL CONDUCTIVITY AFTER TIMESTEP 
    1    1  322683   1     0    48   389  SATURATED SOIL WATER SUCTION      **
    0    1  773952   1     0    49   390  SEA-ICE TEMPERATURE AFTER TIMESTEP  
    0   70  773952   1     0    57   391  TOTAL AEROSOL EMISSIONS (FOR VIS)   
    0   35  773952   1     0    60   461  OZONE                             **
    0   70  773952   1     0    90   496  TOTAL AEROSOL (FOR VISIBILITY)

Can the SW_recon_canht task run with an access-g input grid?

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Cannot get suite to detect end of task InitialFC.

Task is launched using wrapper script au-aa314/bin/UM-wrapper.sh which contains the following logic

cat >> $SCRIPT <<EOF
if (( RC != 0 )); then
    cylc task failed "UM-wrapper.sh: JOB FAILED"
else
    #
    # move *ca* files into place --- until knows how to do this in UM
    # 

    #ls -lrt $UM_DATAW/pe_output/${JOBID}.fort6.pe0
    #grep "PBS: job killed" $UM_DATAW/pe_output/${JOBID}.fort6.pe0 
    #RC1=$?
    #if (( RC1 = 0 )); then
    #  cylc task failed "UM-wrapper.sh: JOB FAILED due to PBS"
    #fi

    ls $UM_DATAW/*cxbkgerr*
    if [[ $? -eq 0 ]]; then
      echo "mv $UM_DATAW/*cxbkgerr*  $PP7/."
      mv $UM_DATAW/*cxbkgerr*  $PP7
    fi
    if [[ $fc_type = InitialFC ]]; then
      echo "mv $UM_DATAM/*ca*  $PPVAR/."
      mv $UM_DATAM/*_ca0*  $PPVAR/.
      mv $UM_DATAM/*da000* $PREV_STAGING_DIR/iau-start-dump
      #ln -s $ASTART $PREV_STAGING_DIR/iau-start-dump
    else
      echo "mv $UM_DATAM/*_ca0*  $PPVAR/."
      mv $UM_DATAM/*_ca0*  $PPVAR/.
      mv $UM_DATAM/*da000 $STAGING_DIR/iau-start-dump
      #mv $UM_DATAM/*dz003 $ARCDIR/cycle_dumps/iau-start-dump.0.${CYLC_TASK_CYCLE_TIME}
    fi
    touch $UM_DATAW/done_fc.${CYLC_TASK_CYCLE_TIME}
    cylc task succeeded
fi
EOF

In this case the file SCRIPT=/short/dp9/pag548/work/au-aa314/2015041200/tmp/xbgjk.27327/SCRIPT

The suite cannot detect the job completion and therefore subsequent commands to move UM output files to the relevant locations for cx-background, ls-fields-dir and iau-start-dump never occur.

I have compared the relevant files in the job submission (i.e.

/short/dp9/pag548/work/au-aa314/2015041200/tmp/xbgjk.27327/SCRIPT
/short/dp9/pag548/work/au-aa314/2015041200/tmp/xbgjk.27327/SUBMIT
/short/dp9/pag548/work/au-aa314/2015041200/tmp/xbgjk.27327/UMSUBMIT

to their equivalent in the SREP suite au-aa202 (i.e.

/short/dp9/pag548/work/au-aa202/2014101204/tmp/xbjiu.30154/SCRIPT
/short/dp9/pag548/work/au-aa202/2014101204/tmp/xbjiu.30154/SUBMIT
/short/dp9/pag548/work/au-aa202/2014101204/tmp/xbjiu.30154/UMSUBMIT

and can find no relevant differences that would explain why suite au-aa202 is able to detect job completion for task InitialFC but suite au-aa314 cannot.

penguian commented 9 years ago

@paul.gregory@bom.gov.au changed _comment0 which not transferred by tractive

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Houskeeping error conflict occurs in OPS for first analysis cycle for 2015041200.

OPS output (Ops_amv)

OPS_BIN                        = /projects/access/da/source/ops/ops30.0.0/build_raijin@2688/build/bin
OPS_BACKERR                    = /short/dp9/pag548/work/au-aa314/2015041200/run/staging/bgerr.2015041200
OPS_BACKERROLD                 = /short/dp9/pag548/work/au-aa314/2015041118/run/staging/bgerr.new
OPS_BACKERRUM                  = /short/dp9/pag548/work/au-aa314/2015041118/run/staging/cx.background
Mismatch between model background and housekeeping times:
Housekeeping time: 2015  4 12  0  Model VT: 2015  4 12  3

Valid times for OPS_BACKERRUM (taken from xconv)

2015/04/11:22.00 / 0.041667
2015/04/11:23.00 / 0.083333
2015/04/12:00.00 / 0.125000
2015/04/12:01.00 / 0.166667
2015/04/12:02.00 / 0.208333
2015/04/12:03.00 / 0.250000

Valid times for OPS_BACKERR (taken from xconv)

2015/04/12:00.00 / 0.250000

Surely the valid times are correct for an analysis cycle at 2015041200?
The cx.background file was copied from /short/dp9/pag548/work/au-aa314/2015041118/run/um-output/dataw/xbgjk.cxbkgerr

This was created using task InitialFC.

I'm not sure why this cx.background file has a model background valid time of 2015041203.

Is there a value in the header of the cx.background file that has been set to the wrong reference time?

penguian commented 9 years ago

@paul.gregory@bom.gov.au changed _comment0 which not transferred by tractive

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Additional issues seemingly created by the cx.background. There are various errors flagged when task bgerr runs. Here is the task bgerr stdout

Running /short/dp9/pag548/work/au-aa314/2015041200/ops/build/bin/OpsProg_BackErrCreate.exe on 1 PEs, mode 20
---------------------------------------------------------

OPS_BACKERRUM               = /short/dp9/pag548/work/au-aa314/2015041118/run/staging/cx.background
OPS_BACKERROLD              = /short/dp9/pag548/work/au-aa314/2015041118/run/staging/bgerr.new
OPS_BACKERR                 = /short/dp9/pag548/work/au-aa314/2015041200/run/staging/bgerr.new

 =====================================================
 GCOM Version 4.4
 MPP
 Using precision : 64bit INTEGERs and 64bit REALs
 Built at 48248
 =====================================================

<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/OpsProg_BackErrCreate.html">OpsProg_BackErrCreate</a>
=========================================
OpsProg_BackErrCreate : Execution starts
 at 09:49:55 on 26/06/2015
=========================================
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
Lookup header has been modified to correct bgerr field orientation.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background Pressure  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background 1000mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  850mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  700mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  500mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  300mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  200mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background  100mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background   50mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background   20mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_BECWind.html">Ops_BECWind</a>
<span style="color: maroon">WARNING</span>
T+6 u-wind field not found in B/G file
so returning to read in existing BGE field.

 Background   10mb Wind  field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background 1000mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  850mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  700mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  500mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  300mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  200mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background  100mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background   50mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background   20mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

 Background   10mb Temp. field either corrupt or not on fields-file :
 using existing Background Error field from previous run.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

4 Surface BGE fields updated, and 7 P-level BGE fields updated.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/OpsProg_BackErrCreate.html">OpsProg_BackErrCreate</a>
=========================================
OpsProg_BackErrCreate ends normally
 at 09:49:55 on 26/06/2015
=========================================
cylc (scheduler - 2015/06/26 09:49:58): bgerr.2015041200 succeeded at 2015-06-26T09:49:58
JOB SCRIPT EXITING (TASK SUCCEEDED)
======================================================================================

Previous bgerr generation (for N512 global DA runs) has the following output

---------------------------------------------------------
Running /short/dp9/pag548/work/au-aa146/2015013000/ops/build/bin/OpsProg_BackErrCreate.exe on 1 PEs, mode 20
---------------------------------------------------------

OPS_BACKERRUM               = /short/dp9/pag548/work/au-aa146/2015012918/run/staging/cx.background
OPS_BACKERROLD              = /short/dp9/pag548/work/au-aa146/2015012918/run/staging/bgerr.new
OPS_BACKERR                 = /short/dp9/pag548/work/au-aa146/2015013000/run/staging/bgerr.new

 =====================================================
 GCOM Version 4.4
 MPP
 Using precision : 64bit INTEGERs and 64bit REALs
 Built at 48248
 =====================================================

<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/OpsProg_BackErrCreate.html">OpsProg_BackErrCreate</a>
=========================================
OpsProg_BackErrCreate : Execution starts
 at 06:53:29 on 23/04/2015
=========================================
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
Lookup header has been modified to correct bgerr field orientation.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/Ops_OpenEnv_inner.html">Ops_OpenEnv_inner</a>
OPS_BACKERRCONTROL_NL_DIR not set

5 Surface BGE fields updated, and 27 P-level BGE fields updated.
<a href="http://www-nwp/~opsrc/OPS/view/dev/doc/OpsProg_BackErrCreate.html">OpsProg_BackErrCreate</a>
=========================================
OpsProg_BackErrCreate ends normally
 at 06:53:30 on 23/04/2015
=========================================
cylc (scheduler - 2015/04/23 06:53:34): bgerr.2015013000 succeeded at 2015-04-23T06:53:33
JOB SCRIPT EXITING (TASK SUCCEEDED)
======================================================================================
penguian commented 9 years ago

@paul.gregory@bom.gov.au uploaded file xbgjk.fort6.pe0 (253.4 KiB)

penguian commented 9 years ago

@paul.gregory@bom.gov.au uploaded file xbgjk000.xbgjk.d15177.t170148.leave (590.9 KiB)

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Task InitialFC seems to be exiting abnormally. FC_HOURS for the task is set to 12.

/short/dp9/pag548/work/au-aa314/2015041200/tmp/umui_runs/xbgjk-180103000> grep 'RUN_TARGET_END' *
CNTLALL: RUN_TARGET_END= 0 , 0 , 0 , 12 , 0 , 0 ,
CONTCNTL: RUN_TARGET_END= 0 , 0 , 0 , 12 , 0 , 0 ,
SIZES: RUN_TARGET_END= 0 , 0 , 0 , 12 , 0 , 0 ,
SIZES_orig: RUN_TARGET_END= 0 , 0 , 0 , 38 , 5 , 0 ,

But the task seems to exit after the 6th hour.

/short/dp9/pag548/work/au-aa314/2015041118/run/um-output/datam> ls
history_archive  xbgjka_da003  xbgjka_pb002  xbgjka_pc005  xbgjka_pe004  xbgjka_pg001  xbgjka_pj002
xbgjka_ca001     xbgjka_pa002  xbgjka_pb003  xbgjka_pd000  xbgjka_pe005  xbgjka_pg002  xbgjka_pj003
xbgjka_ca002     xbgjka_pa003  xbgjka_pc000  xbgjka_pd001  xbgjka_pf002  xbgjka_pg003  xbgjka_pj004
xbgjka_ca003     xbgjka_pa004  xbgjka_pc001  xbgjka_pd002  xbgjka_pf003  xbgjka_pg004  xbgjka_pj005
xbgjka_ca004     xbgjka_pa005  xbgjka_pc002  xbgjka_pd003  xbgjka_pf004  xbgjka_pg005
xbgjka_ca005     xbgjka_pb000  xbgjka_pc003  xbgjka_pe002  xbgjka_pf005  xbgjka_pj000
xbgjka_ca006     xbgjka_pb001  xbgjka_pc004  xbgjka_pe003  xbgjka_pg000  xbgjka_pj001

The pe0 file says that the UM dum file xbgjka_da003 is valid at 2015041200

DUMPCTL: Opening new file xbgjka_da003         on unit  22
MPPIO: Open: xbgjka_da003 on unit  22                              

 WRITING UNIFIED MODEL DUMP ON UNIT 22
 #####################################

 FIXED LENGTH HEADER
 -------------------
 Dump format version    20
 UM Version No         802
 Atmospheric data
 Charney-Phillips on radius levels
 Over LAM domain with no wrap around
 Instantaneous dump
 Exp No =-32768 Run Id =     0
 Gregorian calendar
 Arakawa C grid
                 Year  Month Day Hour Min  Sec  DayNo  
 Data time     = 2015    4   11   21    0    0    101
 Validity time = 2015    4   12    0    0    0    102
 Creation time = 2015    6   29   10   35   42 -32768

The tail of the pe0 file says

  Maximum vertical velocity at timestep                    432 
       Max w this run 
    w_max   level  proc         position           run w_max level timestep
   0.199E+02  56  856   76.4% East    70.3% North  0.254E+02   57   409
MPPIO: Synchronising unit:  60 with disk.
MPPIO: Synchronising unit:  62 with disk.
MPPIO: Synchronising unit:  64 with disk.
MPPIO: Synchronising unit:  65 with disk.
MPPIO: Synchronising unit:  66 with disk.
MPPIO: Synchronising unit:  69 with disk.
MPPIO: Synchronising unit: 102 with disk.
MPPIO: Synchronising unit: 150 with disk.
 JobCtl : Timestep                    432  Job No                      1 
  released. CREQUEST : %%% A_JOB_MEMBER_1  REL NET
JOB CONTROL:   533.071:%%% A_JOB_MEMBER_1  REL NET

I've attached the full output files xbgjk.fort6.pe0 and xbgjk000.xbgjk.d15177.t170148.leave​

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Our InitialFC task never completes correctly, which is why the suite never catches the exit status. It's also possible why OPS fails and VAR reconfiguration also fails, because these files have never been closed correctly.

I have diffed the output between our InitialFC task and the SREP equivalent. :/short/dp9/pag548/work/au-aa314/2015041118/run/um-output/dataw/pe_output> xxdiff xbgjk.fort6.pe0 /short/dp9/pag548/work/au-aa202/2014101203/run/um-output/dataw/pe_output/xbjiu.fort6.pe0&

Both task successfully run for 432 model timesteps (6 hours) are essentially indentical until this point

tm_Step: Timestep      432   Model time:   2015-04-12 03:00:00
Maximum upward Courant number = limited at 17500 points by reseting w and w_adv
Vertical Courant number =  limited in  1359 columns by reseting w and w_adv
MPPIO: Open: xbgjka_pa005 on unit  60                              
 Between timestep    216 and    432 average iterations =       5.491
 Iterations: Max #   7 at timestep    222. Min   5 at timestep    224
MPPIO: Open: xbgjka_pc005 on unit  62                              
MPPIO: Open: xbgjka_pe005 on unit  64                              
MPPIO: Open: xbgjka_ca006 on unit 150                              
MPPIO: Open: xbgjka_pf005 on unit  65                              

 Minimum theta level 1 for timestep                    432
                This timestep                         This run
   Min theta1     proc          position            Min theta1 timestep
      292.68     499    59.9% East     39.2% North     286.45   126
  Largest negative delta theta1 at minimum theta1 
 This timestep #    -4.47K. At min for run    -0.60K

  Maximum vertical velocity at timestep                    432 
       Max w this run 
    w_max   level  proc         position           run w_max level timestep
   0.199E+02  56  856   76.4% East    70.3% North  0.254E+02   57   409
MPPIO: Synchronising unit:  60 with disk.
MPPIO: Synchronising unit:  62 with disk.
MPPIO: Synchronising unit:  64 with disk.
MPPIO: Synchronising unit:  65 with disk.
MPPIO: Synchronising unit:  66 with disk.
MPPIO: Synchronising unit:  69 with disk.
MPPIO: Synchronising unit: 102 with disk.
MPPIO: Synchronising unit: 150 with disk.
 JobCtl : Timestep                    432  Job No                      1 
  released. CREQUEST : %%% A_JOB_MEMBER_1  REL NET
JOB CONTROL:   515.697:%%% A_JOB_MEMBER_1  REL NET

At this point the output for InitialFC task for the TCX stops. The job sits in the cylc queue but nothing happens. The older SREP task continues with

 U_MODEL: Warning: exiting at a period that is not a dump period
 Therefore continuing the run will rerun preceding timesteps
 This is inefficient and can cause restart problems
MPPIO: Close: xbjiua_pa005 on unit  60
MPPIO: Close: xbjiua_pb002 on unit  61
MPPIO: Close: xbjiua_pc005 on unit  62
MPPIO: Close: xbjiua_pd002 on unit  63
MPPIO: Close: xbjiua_pe005 on unit  64
MPPIO: Close: xbjiua_pf005 on unit  65
MPPIO: Close: xbjiua_pg005 on unit  66
MPPIO: Close: xbjiua_pj005 on unit  69
MPPIO: Close: /short/dp9/pag548/work/au-aa202/2014101203/run/um-output/dataw/xbjiu.cxbkgerr on unit 102
MPPIO: Originally from environment variable CXBKGERR
MPPIO: Close: /short/dp9/pag548/work/au-aa202/2014101204/input/lbc6.2014101204 on unit 125
MPPIO: Originally from environment variable ALABCIN1
MPPIO: Close: xbjiua_ca0100 on unit 150
????????????????????????????????????????????????????????????????????????????????
!!???????????????????????????????? ATTENTION ?????????????????????????????????!!
? This run generated  46 warnings
????????????????????????????????????????????????????????????????????????????????

*******************************************************************************
***************** End of UM RUN Job : 16:23:18 on 19/05/2015 ******************
*******************************************************************************

 ******************************************

 END OF RUN - TIMER OUTPUT
 Timer information is for whole run
 PE                      0  Elapsed CPU Time:    344.810582000000     
 PE                      0   Elapsed Wallclock Time:    359.273842000000     

 Total Elapsed CPU Time:    427517.321484000     
 Maximum Elapsed Wallclock Time:    359.318986000000     
 Speedup:    1189.79886435503     
 --------------------------------------------
                    Non-Inclusive Timer Summary for PE    0
   ROUTINE              CALLS  TOT CPU    AVERAGE   TOT WALL  AVERAGE  % CPU    % WALL    SPEED-UP
  1 UM_SHELL                1    344.81    344.81    359.27    359.27  100.00  100.00      0.96

MPP Timing information : 
1216 processors in atmosphere configuration  32 x  38
 Number of OMP threads :                      1

 MPP : None Inclusive timer summary

 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 UM_SHELL                359.22   359.21     0.03       0.01%   359.32 ( 376)   359.16 (1165)

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 UM_SHELL                351.58   351.53     0.46       0.13%   354.28 (  33)   344.81 (   0)

 PARALLEL SPEEDUP SUMMARY (sorted by wallclock times)
    ROUTINE                 CPU TOTAL   WALLCLOCK MAX   SPEEDUP   PARALLEL EFFICIENCY
  1 UM_SHELL                427517.32          359.32   1189.80                  0.98

 END OF RUN - TIMER OUTPUT
 Timer information is for whole run
                    Inclusive Timer Summary for PE    0
   ROUTINE              CALLS  TOT CPU    AVERAGE   TOT WALL  AVERAGE  SPEED-UP
  1 TIMER                   1      0.00      0.00      0.00      0.00      1.94

MPP Timing information : 
1216 processors in atmosphere configuration  32 x  38
 Number of OMP threads :                      1

 MPP : Inclusive timer summary

 WALLCLOCK  TIMES
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 TIMER                     0.00     0.00     0.00     244.18%     0.01 ( 376)     0.00 ( 592)

 CPU TIMES (sorted by wallclock times)
    ROUTINE                   MEAN   MEDIAN       SD   % of mean      MAX   (PE)      MIN   (PE)
  1 TIMER                     0.00     0.00     0.00     367.10%     0.00 (   0)     0.00 (   1)

 PARALLEL SPEEDUP SUMMARY (sorted by wallclock times)
    ROUTINE                 CPU TOTAL   WALLCLOCK MAX   SPEEDUP   PARALLEL EFFICIENCY
  1 TIMER                        0.08            0.01     13.82                  0.01

Process     0 has exited.

So although our ACCESS-TCX InitialFC forecast finishes at outputs the correct files, it never closes them and the task never finishes.

This explains why the suite can't pick up on the exit status. It may also explain why OPS fails (because the model valid time hasn't been written correctly to the cx.background file) and why the VAR reconfiguration fails (we generate a bus error).

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


Var task ConfigLS fails with bus error.

Namelist for Var have been copied from the R12 ngamai suite sanbe. The locations on ngamai are

/g/sc/ophome/access/nwpdir/share/APS2/VAR/control/VarGrid/my/v5/VarUMGrid 
/g/sc/ophome/access/nwpdir/share/APS2/VAR/control/VarGrid/my/v5/VarGrid_2

The lat/lon spacing for the Var grid is the same as this, but the initial lat/lon and no. of points has been altered in the envfile and the relevant namelists to ensure the Var grid lies within the UM grid.

Updated namelists now located on raijin for ACCESS-TCX at

/projects/access/da/nwp_input/ops_var/VAR/control/VarGrid/tcx

Var task fails at this point according to stdout

      1 Processors initialised.
I am PE     0
PrintStatus is set to       1
MPPIO: Open: /short/dp9/pag548/work/au-aa314/2015041118/run/staging/ls-fields-dir/xbgjka_ca001 on unit  10                              
MPPIO: from environment variable AINITIAL
MPPIO: Open: /short/dp9/pag548/work/au-aa314/2015041200/run/staging/ls-dir/xbgjka_ca001 on unit  11                              
MPPIO: from environment variable ASTART
MPPIO: Open: /jobfs/local/1247936.r-man2/J3vo51appY/recontmp.0 on unit  10                              
MPPIO: from environment variable RECONTMP

????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????? WARNING ????????????????????????????????????
? Warning in routine: Setup_LSM_Out
? Warning Code:   -10
? Warning Message: Land-Sea Mask is not in input file
? Warning generated from processor:     0
????????????????????????????????????????????????????????????????????????????????

 Land Frac is not in output dump => setting to dummy
 Land T* is not in output dump => setting to dummy

(ends)

Stderr reports a bus error when reading the ls-fields-dir files

/projects/access/da/source/var/var30.0.0/build_raijin/build/bin/VarScr_ReconParallel: line 318: 8308: Bus error(coredump)
VarScr_ReconParallel: Failure reconfiguring /short/dp9/pag548/work/au-aa314/2015041118/run/staging/ls-fields-dir/xbgjka_ca001
VarScr_ConfigureLS: VarScr_ReconParallel failed.
penguian commented 9 years ago

@paul.gregory@bom.gov.au _uploaded file xbgjk_xbjim_TCX_Global_v82_InitialFC_diff.txt (152.9 KiB)_

Job diff b/w/ xbgjk (ACCESS-TCX InitialFC) and xbjim (APS2 ACCESS-G InitialFC)

penguian commented 9 years ago

@paul.gregory@bom.gov.au commented


In an attempt to determine the reason why the cx.background produced by UM job xbgjk in ACCESS-TCX causes errors in get_bgerr and OPS and I diffed the job with the APS2 ACCESS-G InitialFC UM job xbjim.

I've attached the full diff, but the relevant sections form OPS and VAR are

Difference in window atmos_STASH_Macros_OPS
 -> Model Selection
   -> Atmosphere
     -> STASH
       -> STASH macros
         -> OPS interface macros
Differences in Table Output time list
 1,2c1
<  1
<  2
---
>  0
4,5d2
<  4
<  5
7d3
<  7

Entry box: Number of times in list
 Job xbgjk: Entry is set to '7'
 Job xbjim: Entry is set to '3'
Radio button: CX fields macro:
 Job xbgjk: Entry is set to 'Standard Macro'
 Job xbjim: Entry is set to 'Development Macro'
Radio button: Background error fields macro (global only):
 Job xbgjk: Entry is inactive
 Job xbjim: Entry is set to 'Standard Macro'

Difference in window atmos_STASH_Macros_VAR
 -> Model Selection
   -> Atmosphere
     -> STASH
       -> STASH macros
         -> VAR interface macro
Check box: WGDOS pack fields
 Job xbgjk: Entry is set to 'ON'
 Job xbjim: Entry is set to 'OFF'
Check box: Include aerosol (90)
 Job xbgjk: Entry is set to 'ON'
 Job xbjim: Entry is set to 'OFF'
Entry box: Ending
 Job xbgjk: Entry is set to '6'
 Job xbjim: Entry is set to '5'

Is there any entry here which could cause the errors in the bgerr creation and the mismatch of the model valid time?

Or perhaps it is a hand-edit somewhere else in the job?