NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
195 stars 145 forks source link

DART freezed before computing prior observation values #707

Closed 664787022 closed 1 month ago

664787022 commented 3 months ago

Hi everyone,

I am using WACCM+DART and can successfully run the model to assimilate MLS temperature. However, sometimes DART gets stuck at the "before computing prior observation values" stage, as indicated in the da.log file. Other times, the run completes without issues.

I am not sure what's causing this inconsistency. Has anyone else encountered this problem and can offer some advice?

Error Message

Sat Jul 27 23:09:09 CST 2024 -- BEGIN CAM_ASSIMILATE
valid time of model is 2012 12 5 21600 (seconds)
valid time of model is 2012 12 5 6 (hours)
/usr/bin/ls: cannot access ../Hide*: No such file or directory
most recent log is cesm.log.14394469.240727-230022
oldest      log is cesm.log.14394469.240727-230022
entire log list is cesm.log.14394469.240727-230022

Sat Jul 27 23:09:15 CST 2024 -- BEGIN COPY BLOCK
Sat Jul 27 23:09:15 CST 2024 -- END COPY BLOCK
stages_except_output = {}
stages_all = {,output}
OBS_FILE = /data2/share/elzd_2024_000125/xhw/downloadData/MLS/2013/Level2/DART_seq/201212_6H_CESM/obs_seq.2012-12-05-21600
inf_flavor(1) = 5, using namelist values.
Posterior Inflation not requested for this assimilation.
Sat Jul 27 23:09:15 CST 2024 -- BEGIN FILTER
srun: ROUTE: split_hostlist: hl=a3106n05,a3401n[10,13,16],a3405n01,a3408n13,b2104r4n[3,6],b3202r2n[3,5,7],b3202r3n1,b3306r8n[1,3],b3309r3n[5,7] tree_width 0
srun: ROUTE: split_hostlist: hl=b3309r3n6 tree_width 0
srun: ROUTE: split_hostlist: hl=a3106n06,a3401n09 tree_width 0
srun: ROUTE: split_hostlist: hl=b3202r3n2 tree_width 0
srun: ROUTE: split_hostlist: hl=b3202r2n4 tree_width 0
srun: ROUTE: split_hostlist: hl=b3202r2n6 tree_width 0
srun: ROUTE: split_hostlist: hl=a3405n[02-03] tree_width 0
srun: ROUTE: split_hostlist: hl=b2104r4n7,b3202r2n2 tree_width 0
srun: ROUTE: split_hostlist: hl=b2104r4n[4-5] tree_width 0
srun: ROUTE: split_hostlist: hl=a3401n[14-15] tree_width 0
srun: ROUTE: split_hostlist: hl=b3306r8n2 tree_width 0
srun: ROUTE: split_hostlist: hl=b3202r2n8 tree_width 0
srun: ROUTE: split_hostlist: hl=b3306r8n4 tree_width 0
srun: ROUTE: split_hostlist: hl=b3309r3n8 tree_width 0
srun: ROUTE: split_hostlist: hl=a3401n[11-12] tree_width 0
srun: ROUTE: split_hostlist: hl=a3401n[17-18] tree_width 0
srun: ROUTE: split_hostlist: hl=a3408n[14-15] tree_width 0

 --------------------------------------
 Starting ... at YYYY MM DD HH MM SS = 
                 2024  7 27 23  9 19
 Program Filter
 --------------------------------------

  set_nml_output Echo NML values to log file only
 PE 0: initialize_mpi_utilities:  Running with         2560  MPI processes.

 Assimilate_these_obs_types:
    AURAMLS_TEMPERATURE
 Evaluate_these_obs_types:
    none
 Use the precomputed Prior Forward Operators for these obs types:
    none

 PE 0: location_mod: using code with optimized cutoffs
 PE 0: location_mod: Including vertical separation when computing distances:
 PE 0: location_mod:        # pascals ~ 1 horiz radian:       20000.00000
 PE 0: location_mod:         # meters ~ 1 horiz radian:       10000.00000
 PE 0: location_mod:   # model levels ~ 1 horiz radian:          20.00000
 PE 0: location_mod:  # scale heights ~ 1 horiz radian:           1.50000
 PE 0: location_mod: Using table-lookup approximation for distance computations
 PE 0: init_discard_high_obs Discarding observations higher than model level    
  5
 PE 0: init_discard_high_obs  ... which is equivalent to pressure level  0.44041
 E-02 Pascals
 PE 0: init_discard_high_obs  ... which is equivalent to height         114178.4
 9250 meters
 PE 0: init_discard_high_obs  ... which is equivalent to scale height       16.9
 3814
 PE 0: quality_control_mod: Will reject obs with Data QC larger than    3
 PE 0: quality_control_mod: Will reject obs values more than    3.000000 sigma f
 rom mean
 PE 0: init_algorithm_info_mod:  No QCF table file listed in namelist, using def
 ault values for all QTYs
 PE 0: assim_tools_init: The cutoff namelist value is           0.150000
 PE 0: assim_tools_init: ... cutoff is the localization half-width parameter,
 PE 0: assim_tools_init: ... so the effective localization radius is           0
 .300000
 PE 0: assim_tools_init: Using Sampling Error Correction
 PE 0: assim_tools_init: Replicating a copy of the ensemble mean on every task
 PE 0: assim_tools_init: ... uses more memory per task but may run faster if doi
 ng vertical
 PE 0: assim_tools_init: ... coordinate conversion; controlled by namelist item 
 "distribute_mean"
 PE 0: assim_tools_init: Doing vertical localization, vertical coordinate conver
 sion may be required
 PE 0: assim_tools_init: ... Converting all state vector verticals to localizati
 on coordinate first.
 PE 0: assim_tools_init: ... Converting all observation verticals to localizatio
 n coordinate first.
 PE 0:  filter trace: Filter start
 Filter start TIME: 2024/07/27 23:09:21
 PE 0: filter_main: running with an ensemble size of     5
 PE 0:  filter trace: Before initializing inflation
 PE 0: filter_main: Prior inflation damping of     0.900000 will be used
 PE 0:  filter trace: After  initializing inflation
 PE 0: parse_stages_to_write:  filter will write stage : output
 PE 0:  filter trace: Before setting up space for observations
 Before setting up space for observations TIME: 2024/07/27 23:09:21
 After  setting up space for observations TIME: 2024/07/27 23:09:21
 PE 0:  filter trace: After  setting up space for observations
 PE 0:  filter trace: Before setting up space for ensembles
 PE 0: filter_main: running with distributed state; model states stay distribute
 d across all tasks for the entire run
 PE 0:  filter trace: After  setting up space for ensembles
 PE 0:  filter trace: Before reading in ensemble restart files
 Before reading in ensemble restart files TIME: 2024/07/27 23:09:21
 PE 0: Prior inflation: deterministic, deflation permitted, enhanced time-adapti
 ve, time-rate adaptive, spatially-varying, state-space
 PE 0: Prior inflation: inf mean   from namelist, value:    1.000
 PE 0: Prior inflation: inf stddev from namelist, value:    0.600
 PE 0: Prior inflation: inf stddev max change:    1.050
 PE 0: Posterior inflation: None
 PE 0: filter_main: Reading in initial condition/restart data for all ensemble m
 embers from file(s)
 After  reading in ensemble restart files TIME: 2024/07/27 23:09:28
 PE 0:  filter trace: After  reading in ensemble restart files
 PE 0:  filter trace: Before initializing output files
 Before initializing output files TIME: 2024/07/27 23:09:28
 After  initializing output files TIME: 2024/07/27 23:09:28
 PE 0:  filter trace: After  initializing output files
 PE 0:  filter trace: Before trimming obs seq if start/stop time specified
 PE 0:  filter trace: After  trimming obs seq if start/stop time specified
 PE 0:  filter trace: Top of main advance time loop
 PE 0:
 PE 0: filter: Main assimilation loop, starting iteration    0
 PE 0:  filter trace: Before move_ahead checks time of data and next obs
 PE 0: shortest_time_between_assimilations:  assimilation period is            0
   days        21600  seconds
 PE 0: move_ahead Current model data time            is:  day=  150453 sec= 2160
 0
 PE 0: move_ahead Current assimilation window starts at:  day=  150453 sec= 1080
 1
 PE 0: move_ahead Next available observation time    is:  day=  150453 sec= 1082
 4
 PE 0: move_ahead Current assimilation window ends   at:  day=  150453 sec= 3240
 0
 PE 0: shortest_time_between_assimilations:  assimilation period is            0
   days        21600  seconds
 PE 0: move_ahead Next available observation time    is:  day=  150453 sec= 1082
 4
 PE 0: move_ahead Within current assimilation window, model does not need advanc
 e.
 PE 0: move_ahead Next assimilation window contains up to    35670 observations
 PE 0:  filter trace: After  move_ahead checks time of data and next obs
 PE 0: filter: Model does not need to run; data already at required time
 PE 0:  filter trace: Before setup for next group of observations
 PE 0:  filter trace: Number of observations to be assimilated  35670
 filter trace: Time of first observation in window day=150453, sec=10824
 filter trace: Time of last  observation in window day=150453, sec=32384
 PE 0:  filter trace: After  setup for next group of observations
 PE 0:  filter trace: Before prior inflation damping and prep
 PE 0:  filter trace: After  prior inflation damping and prep
 PE 0:  filter trace: Before computing prior observation values
 Before computing prior observation values TIME: 2024/07/27 23:09:28

Which model(s) are you working with?

WACCM

Version of DART

Which version of DART are you using? v11.5.1

Have you modified the DART code?

No

Build information

Please describe:

  1. The machine you are running on (e.g. windows laptop, NSF NCAR supercomputer Derecho).
    cluster
  2. The compiler you are using (e.g. gnu, intel) Intel
mjs2369 commented 3 months ago

Hello, thanks for reaching out. Could you please share with us the contents of the input.nml file you are using? @664787022

mjs2369 commented 3 months ago

And for some initial suggestions, I would try running with more ensemble members (maybe 50 instead); 5 is a very small ensemble and has caused this issue to occur in the past.

Are you using a different number of MPI processes across your runs, or are you consistently running on 2560?

I would also recommend submitting this issue to the email dart@ucar.edu for support. This is where we handle user support requests as opposed to creating issues on GitHub, which are more for explicit bugs or feature requests.

hkershaw-brown commented 1 month ago

Hi @664787022 closing this issue since we have not heard from you. Please email dart@ucar.edu if you are still having problems. Cheers, Helen