NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
184 stars 139 forks source link

Implementing CLM-DART within CESM2.3 #519

Open braczka opened 11 months ago

braczka commented 11 months ago

Use case

Successfully run CLM compsets within DART using CESM2.3. Currently CLM-DART is vetted using release-cesm2.2.0, but the goal here is to anticipate software changes required for cesm2.3 tags.

Is your feature request related to a problem?

Related to reported challenges of implementing DART within CESM2.3 as reported by CMCC group. Specifically cmcc reports a shape mismatch error when implementing fill_inflation_restart or filter. This could be related to structural differences between the clm_restart.nc, clm_history.nc and clm_vector_history.nc files.

There are related issues to this one including the implementation of NUOPC within CESM2.3 and the impact on the CAM6-Reanalysis files. These related issues are here #474 and here #463.

Problem Description

Screen Shot 2023-07-19 at 11 11 47 AM

braczka commented 11 months ago

New dimensions in cesm2.3.0 include levmaxurbgrnd = 25, mxsowings =1 and mxharvests = 2. levmaxurbgrnd dimension replaces the levgrnd dimension for some variables.

There are also many differences in restart file variables (history and vector_history only show numeric differences), but most of these are just diagnostic variables that should not be impactful on functioning on model_mod. List of differences provided for reference.

Screen Shot 2023-07-19 at 11 22 43 AM

braczka commented 11 months ago

I was not able to reproduce the error (all tests ran as expected) using mct driver implementations of cesm2.3.0 using the following tests below. Note: cesm2.3.0_beta08 was branch-off point for CMCC development. CMCC used. DART v10.7.0 was used for testing, cmcc was using DARTv10.5.3, but this older tag includes all important changes with CLM.

1) cesm2.3.0_beta12 output tested against fill_inflation_restart and model_mod_check (1-4) 2) cesm2.3.0_beta08 output tested against fill_inflation_restart and model_mod_check (1-4)

3) cesm2.3.0_beta12 output tested against CLM Tutorial setup for 1 assimilation time step 4) cesm2.3.0_beta08 output tested against CLM Tutorial setup for 1 assimilation time step

Side note: cesm2.3.0 version required methane to be turned on (use_lch4= TRUE) if use_nitrif_denitrif = TRUE. This caused mass balance failure in subsequent model integration after initial DART update. Something to consider for cesm2.3.0 SourceMods.

hkershaw-brown commented 11 months ago

Hi Brett, this could be a bug in how dart reads the state, e.g if some variables have unlimited and some don't, this would cause the counts length = 0 (instead of 1).

braczka commented 11 months ago

@hkershaw-brown So that possibility has been on my radar. In addition to the use of new dimensions I mentioned above, there is a 'cohort' dimension within the clm_restart.nc file which has an explicit value for the cesm2.2.0 version, but for cesm2.3.0 the cohort dimension is defined as ``= UNLIMITED; // (0 currently). If that's the problem, I don't understand why my tests didn't fail -- the DART state I tested includes even more variables than what Gustavo was testing.

hkershaw-brown commented 11 months ago

Hi Brett, I think what is going on here:

Assumptions in direct_netcdf_mod.f90:

In the cesm 2.3 clm, there is an unlimited dimension cohorts which is not used by any of the state variables (and, kind of strangely, not used at all)

So this line:

https://github.com/NCAR/DART/blob/1b76f3afa5978469d6a119710bb638c1c077ae20/assimilation_code/modules/io/direct_netcdf_mod.f90#L865 num_dims - 1 = 0

Gustavo's compiler is picking up this 1:0 and erroring out.

intel 19.1.1.21 (running on Cheyenne) just merrily continues, but the counts are incorrect. The count is set to 1 rather than the length of the dimension.

For fill_inflation_restart it does not matter if the state is read in incorrectly, since we just fill the output with (mean and sd (e.g. 1.0, 0.6). Side note here is we don't need to be calling read_state in fill_inflation_restart.

For filter, which does need to read the state, I think the read does not happen correctly (count is not the size of the variable) .

On write, only 'time' (lower case) can be the unlimited dimension. (This is another assumption in dart. It is a weakness in how we deal with unlimited dimensions (https://github.com/NCAR/DART/issues/359#issuecomment-1187818595)) For dart created files, dart won't add an unlimited dimension unless its name is 'time'. So creating and writing a file with fill inflation restart gets you a file with the correct info.

All the assumptions above are a bit of a hack.

I put a fix on https://github.com/NCAR/DART/tree/fix-unlimited_dim-read which checks the variable for unlimited dimension before adjusting the counts reading and writing. It is a narrow solution (it really just fixes the case where there is an unused unlimited dimension).

I'm not sure if the cohorts dimension will be used in the state, if it is then improving the state read/write to cope with various (and multiple) unlimited dimensions gets bumped up the priority list.

Let me know if this makes sense or not.

braczka commented 11 months ago

Thanks for looking at this @hkershaw-brown. I took another look and have a few responses and questions.

First, I was not able to recreate the dimension error mismatch during my initial tests because, as you suspected, the inflation_restart step simply uses the restart files as a template to write with 1 and 0.6 values, and does not reading anything-- so this works OK. When performing the filter update step, however, the code doesn't error out, but the posterior restart files produce 'garbage' values -- in this case the entire domain is filled with zeros.

I redid the test by merging in your updates for direct_netcdf_mod.f90 from branch https://github.com/NCAR/DART/tree/fix-unlimited_dim-read. I reran the CLM tutorial test with cesm2.3 output, and I got the expected behavior -- the posterior restart files gave realistic values, and the increments were localized around the synthetic observations. So this looks like it addresses the immediate issue.

Couple questions -- should I have been able to detect this dimension issue mismatch with my intel compiler settings? It seems the ability to catch this may be compiler specific, and not necessarily something that can be captured by a more stringent set of compiler options/flags? Is that your understanding? Also, it would seem, this fix is specific to the DART code, and the cmcc compiler with still fail unless the CLM dimensional values are changed, including the offending cohort dimension.

It's not exactly clear to me what the role of the cohort dimension is in CLM -- I need to take a closer look at the documentation. Perhaps it's only used in certain compset configurations (CLM-FATES?) or is currently being used as a placeholder for further code development. We may need to get feedback from the CGD SE's to get more perspective on this.

tavicoaz commented 11 months ago

Hi @braczka and @hkershaw-brown Thanks for the answers. I have also updates for direct_netcdf_mod.f90 from branch https://github.com/NCAR/DART/tree/fix-unlimited_dim-read and was able to pass fill_inflation_restart without the counting issue. The code is able to enter the filter however, it crashes now in another inflation related routine. I am not sure whether the inflation files were written correctly or it is a completely new issue. Please find below an extract of the log. For simplicity I am assimilating LAI only. Cheers!

 PE   504: create_and_open_state_output  Creating output file preassim_priorinf_
 After  computing prior observation values TIME: 2023/07/27 13:17:35
 PE 0:  filter trace: After  computing prior observation values
 PE 0:  filter trace: Before preassim state space output
 Before preassim state space output TIME: 2023/07/27 13:17:35
 mean_d01.nc
 PE 0: create_and_open_state_output  Creating output file preassim_member_0001_d
 01.nc
 PE   288: create_and_open_state_output  Creating output file preassim_member_00
 05_d01.nc
 PE   144: create_and_open_state_output  Creating output file preassim_member_00
 PE   432: create_and_open_state_output  Creating output file preassim_sd_d01.nc
 03_d01.nc
 PE   576: create_and_open_state_output  Creating output file preassim_priorinf_
 PE    72: create_and_open_state_output  Creating output file preassim_member_00
 sd_d01.nc
 02_d01.nc
 PE   216: create_and_open_state_output  Creating output file preassim_member_00
 PE   360: create_and_open_state_output  Creating output file preassim_mean_d01.
 04_d01.nc
 nc
 PE 0: create_and_open_state_output  Creating output file preassim_member_0001_d
 02.nc
 PE    72: create_and_open_state_output  Creating output file preassim_member_00
 02_d02.nc
 PE   144: create_and_open_state_output  Creating output file preassim_member_00
 03_d02.nc
 PE   216: create_and_open_state_output  Creating output file preassim_member_00
 04_d02.nc
 PE   288: create_and_open_state_output  Creating output file preassim_member_00
 05_d02.nc
 PE   360: create_and_open_state_output  Creating output file preassim_mean_d02.
 nc
 PE   432: create_and_open_state_output  Creating output file preassim_sd_d02.nc
 PE   504: create_and_open_state_output  Creating output file preassim_priorinf_
 mean_d02.nc
 PE   576: create_and_open_state_output  Creating output file preassim_priorinf_
 sd_d02.nc
 PE 0: create_and_open_state_output  Creating output file preassim_member_0001_d
 03.nc
 PE    72: create_and_open_state_output  Creating output file preassim_member_00
 02_d03.nc
 PE   144: create_and_open_state_output  Creating output file preassim_member_00
 03_d03.nc
 PE   216: create_and_open_state_output  Creating output file preassim_member_00
 04_d03.nc
 PE   288: create_and_open_state_output  Creating output file preassim_member_00
 05_d03.nc
 PE   360: create_and_open_state_output  Creating output file preassim_mean_d03.
 nc
 PE   432: create_and_open_state_output  Creating output file preassim_sd_d03.nc
 PE   504: create_and_open_state_output  Creating output file preassim_priorinf_
 mean_d03.nc
 PE   576: create_and_open_state_output  Creating output file preassim_priorinf_
 sd_d03.nc
 After  preassim state space output TIME: 2023/07/27 13:17:43
 PE 0:  filter trace: After  preassim state space output
 PE 0:  filter trace: Before observation space diagnostics
 PE 0:  filter trace: After  observation space diagnostics
 PE 0: filter: Ready to assimilate up to  249723 observations
 PE 0:  filter trace: Before observation assimilation
 Before observation assimilation TIME: 2023/07/27 13:17:44
 PE 0: locations_mod Location module statistics:
 PE 0: locations_mod  Total boxes (nlon * nlat):      2556
 PE 0: locations_mod  Total items to put in boxes:     11215
 PE 0: locations_mod  Percent boxes with 1+ items:   42.18
 PE 0: locations_mod  Average #items per non-empty box:        10.40
 PE 0: locations_mod  Largest #items in one box:        63
 PE 0: locations_mod Location module statistics:
 PE 0: locations_mod  Total boxes (nlon * nlat):      2556
 PE 0: locations_mod  Total items to put in boxes:       347
 PE 0: locations_mod  Percent boxes with 1+ items:   11.42
 PE 0: locations_mod  Average #items per non-empty box:         1.19
 PE 0: locations_mod  Largest #items in one box:         3
 PE 0: comp_cov_factor: Standard Gaspari Cohn localization selected
Processing observation     1000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     2000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     3000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     4000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     5000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     6000 of   249723 TIME: 2023/07/27 13:17:45
Processing observation     7000 of   249723 TIME: 2023/07/27 13:17:46
Processing observation     8000 of   249723 TIME: 2023/07/27 13:17:46
Processing observation     9000 of   249723 TIME: 2023/07/27 13:17:46
Processing observation    10000 of   249723 TIME: 2023/07/27 13:17:47
Processing observation    11000 of   249723 TIME: 2023/07/27 13:17:47
Processing observation    12000 of   249723 TIME: 2023/07/27 13:17:47
Processing observation    13000 of   249723 TIME: 2023/07/27 13:17:47
Processing observation    14000 of   249723 TIME: 2023/07/27 13:17:48
Processing observation    15000 of   249723 TIME: 2023/07/27 13:17:48
Processing observation    16000 of   249723 TIME: 2023/07/27 13:17:48
Processing observation    17000 of   249723 TIME: 2023/07/27 13:17:48
Processing observation    18000 of   249723 TIME: 2023/07/27 13:17:49
Processing observation    19000 of   249723 TIME: 2023/07/27 13:17:49
Processing observation    20000 of   249723 TIME: 2023/07/27 13:17:49
Processing observation    21000 of   249723 TIME: 2023/07/27 13:17:49
Processing observation    22000 of   249723 TIME: 2023/07/27 13:17:50
Processing observation    23000 of   249723 TIME: 2023/07/27 13:17:50
Processing observation    24000 of   249723 TIME: 2023/07/27 13:17:50
Processing observation    25000 of   249723 TIME: 2023/07/27 13:17:50
Processing observation    26000 of   249723 TIME: 2023/07/27 13:17:50
Processing observation    27000 of   249723 TIME: 2023/07/27 13:17:51
Processing observation    28000 of   249723 TIME: 2023/07/27 13:17:51
Processing observation    29000 of   249723 TIME: 2023/07/27 13:17:51
Processing observation    30000 of   249723 TIME: 2023/07/27 13:17:51
Processing observation    31000 of   249723 TIME: 2023/07/27 13:17:51
Processing observation    32000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    33000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    34000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    35000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    36000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    37000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    38000 of   249723 TIME: 2023/07/27 13:17:52
Processing observation    39000 of   249723 TIME: 2023/07/27 13:17:53
Processing observation    40000 of   249723 TIME: 2023/07/27 13:17:53
Processing observation    41000 of   249723 TIME: 2023/07/27 13:17:53
Processing observation    42000 of   249723 TIME: 2023/07/27 13:17:53
Processing observation    43000 of   249723 TIME: 2023/07/27 13:17:53
Processing observation    44000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    45000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    46000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    47000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    48000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    49000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    50000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    51000 of   249723 TIME: 2023/07/27 13:17:54
Processing observation    52000 of   249723 TIME: 2023/07/27 13:17:55
Processing observation    53000 of   249723 TIME: 2023/07/27 13:17:55
Processing observation    54000 of   249723 TIME: 2023/07/27 13:17:56
Processing observation    55000 of   249723 TIME: 2023/07/27 13:17:56
Processing observation    56000 of   249723 TIME: 2023/07/27 13:17:57
Processing observation    57000 of   249723 TIME: 2023/07/27 13:17:57
Processing observation    58000 of   249723 TIME: 2023/07/27 13:17:58
Processing observation    59000 of   249723 TIME: 2023/07/27 13:17:58
Processing observation    60000 of   249723 TIME: 2023/07/27 13:17:59
Processing observation    61000 of   249723 TIME: 2023/07/27 13:18:00
Processing observation    62000 of   249723 TIME: 2023/07/27 13:18:01
Processing observation    63000 of   249723 TIME: 2023/07/27 13:18:02
Processing observation    64000 of   249723 TIME: 2023/07/27 13:18:02
Processing observation    65000 of   249723 TIME: 2023/07/27 13:18:03
Processing observation    66000 of   249723 TIME: 2023/07/27 13:18:04
Processing observation    67000 of   249723 TIME: 2023/07/27 13:18:05
Processing observation    68000 of   249723 TIME: 2023/07/27 13:18:06
Processing observation    69000 of   249723 TIME: 2023/07/27 13:18:07
Processing observation    70000 of   249723 TIME: 2023/07/27 13:18:08
Processing observation    71000 of   249723 TIME: 2023/07/27 13:18:09
Processing observation    72000 of   249723 TIME: 2023/07/27 13:18:10
Processing observation    73000 of   249723 TIME: 2023/07/27 13:18:11
Processing observation    74000 of   249723 TIME: 2023/07/27 13:18:12
Processing observation    75000 of   249723 TIME: 2023/07/27 13:18:12
Processing observation    76000 of   249723 TIME: 2023/07/27 13:18:13
Processing observation    77000 of   249723 TIME: 2023/07/27 13:18:14
Processing observation    78000 of   249723 TIME: 2023/07/27 13:18:15
Processing observation    79000 of   249723 TIME: 2023/07/27 13:18:16
Processing observation    80000 of   249723 TIME: 2023/07/27 13:18:16
Processing observation    81000 of   249723 TIME: 2023/07/27 13:18:17
Processing observation    82000 of   249723 TIME: 2023/07/27 13:18:18
Processing observation    83000 of   249723 TIME: 2023/07/27 13:18:19
Processing observation    84000 of   249723 TIME: 2023/07/27 13:18:19
Processing observation    85000 of   249723 TIME: 2023/07/27 13:18:20
Processing observation    86000 of   249723 TIME: 2023/07/27 13:18:21
Processing observation    87000 of   249723 TIME: 2023/07/27 13:18:22
Processing observation    88000 of   249723 TIME: 2023/07/27 13:18:23
Processing observation    89000 of   249723 TIME: 2023/07/27 13:18:23
Processing observation    90000 of   249723 TIME: 2023/07/27 13:18:24
Processing observation    91000 of   249723 TIME: 2023/07/27 13:18:25
Processing observation    92000 of   249723 TIME: 2023/07/27 13:18:26
Processing observation    93000 of   249723 TIME: 2023/07/27 13:18:26
Processing observation    94000 of   249723 TIME: 2023/07/27 13:18:27
Processing observation    95000 of   249723 TIME: 2023/07/27 13:18:28
Processing observation    96000 of   249723 TIME: 2023/07/27 13:18:29
Processing observation    97000 of   249723 TIME: 2023/07/27 13:18:29
Processing observation    98000 of   249723 TIME: 2023/07/27 13:18:30
Processing observation    99000 of   249723 TIME: 2023/07/27 13:18:31
Processing observation   100000 of   249723 TIME: 2023/07/27 13:18:32
Processing observation   101000 of   249723 TIME: 2023/07/27 13:18:33
Processing observation   102000 of   249723 TIME: 2023/07/27 13:18:34
Processing observation   103000 of   249723 TIME: 2023/07/27 13:18:35
Processing observation   104000 of   249723 TIME: 2023/07/27 13:18:35
Processing observation   105000 of   249723 TIME: 2023/07/27 13:18:36
Processing observation   106000 of   249723 TIME: 2023/07/27 13:18:37
Processing observation   107000 of   249723 TIME: 2023/07/27 13:18:38
Processing observation   108000 of   249723 TIME: 2023/07/27 13:18:39
Processing observation   109000 of   249723 TIME: 2023/07/27 13:18:40
Processing observation   110000 of   249723 TIME: 2023/07/27 13:18:41
Processing observation   111000 of   249723 TIME: 2023/07/27 13:18:41
Processing observation   112000 of   249723 TIME: 2023/07/27 13:18:42
Processing observation   113000 of   249723 TIME: 2023/07/27 13:18:43
Processing observation   114000 of   249723 TIME: 2023/07/27 13:18:44
Processing observation   115000 of   249723 TIME: 2023/07/27 13:18:45
Processing observation   116000 of   249723 TIME: 2023/07/27 13:18:46
Processing observation   117000 of   249723 TIME: 2023/07/27 13:18:47
Processing observation   118000 of   249723 TIME: 2023/07/27 13:18:48
Processing observation   119000 of   249723 TIME: 2023/07/27 13:18:49
Processing observation   120000 of   249723 TIME: 2023/07/27 13:18:50
Processing observation   121000 of   249723 TIME: 2023/07/27 13:18:51
Processing observation   122000 of   249723 TIME: 2023/07/27 13:18:52
Processing observation   123000 of   249723 TIME: 2023/07/27 13:18:53
Processing observation   124000 of   249723 TIME: 2023/07/27 13:18:54
Processing observation   125000 of   249723 TIME: 2023/07/27 13:18:55
Processing observation   126000 of   249723 TIME: 2023/07/27 13:18:56
Processing observation   127000 of   249723 TIME: 2023/07/27 13:18:57
Processing observation   128000 of   249723 TIME: 2023/07/27 13:18:58
Processing observation   129000 of   249723 TIME: 2023/07/27 13:18:59
Processing observation   130000 of   249723 TIME: 2023/07/27 13:19:00
Processing observation   131000 of   249723 TIME: 2023/07/27 13:19:01
Processing observation   132000 of   249723 TIME: 2023/07/27 13:19:01
Processing observation   133000 of   249723 TIME: 2023/07/27 13:19:02
Processing observation   134000 of   249723 TIME: 2023/07/27 13:19:02
[n109:1529319:0:1529319] Caught signal 8 (Floating point exception: floating-point overflow)
==== backtrace (tid:1529319) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x000000000041f49c __libm_pow_e7()  ???:0
 2 0x00000000005a4696 adaptive_inflate_mod_mp_enh_compute_new_density_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/adaptive_inflate_mod.f90:1051
 3 0x00000000005a3ca1 adaptive_inflate_mod_mp_bayes_cov_inflate_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/adaptive_inflate_mod.f90:932
 4 0x00000000005a2cfd adaptive_inflate_mod_mp_update_inflation_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/adaptive_inflate_mod.f90:637
 5 0x00000000005a342e adaptive_inflate_mod_mp_update_varying_state_space_inflation_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/adaptive_inflate_mod.f90:755
 6 0x00000000005474c5 assim_tools_mod_mp_filter_assim_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/assim_tools_mod.f90:716
 7 0x0000000000516826 filter_mod_mp_filter_main_()  /work/csp/lg07622/spreads/land/DART/assimilation_code/modules/assimilation/filter_mod.f90:885
 8 0x000000000050ee77 MAIN__()  /work/csp/lg07622/spreads/land/DART/assimilation_code/programs/filter/filter.f90:20
 9 0x000000000040e262 main()  ???:0
10 0x000000000003acf3 __libc_start_main()  ???:0
11 0x000000000040e16e _start()  ???:0
=================================
forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source             
filter             000000000080C64B  Unknown               Unknown  Unknown
libpthread-2.28.s  0000150399F82CE0  Unknown               Unknown  Unknown
libhdf5.so.200.1.  000015039D51E49C  Unknown               Unknown  Unknown
filter             00000000005A4696  adaptive_inflate_        1051  adaptive_inflate_mod.f90
filter             00000000005A3CA1  adaptive_inflate_         932  adaptive_inflate_mod.f90
filter             00000000005A2CFD  adaptive_inflate_         637  adaptive_inflate_mod.f90
filter             00000000005A342E  adaptive_inflate_         755  adaptive_inflate_mod.f90
filter             00000000005474C5  assim_tools_mod_m         716  assim_tools_mod.f90
filter             0000000000516826  filter_mod_mp_fil         885  filter_mod.f90
filter             000000000050EE77  MAIN__                     20  filter.f90
filter             000000000040E262  Unknown               Unknown  Unknown
libc-2.28.so       0000150399863CF3  __libc_start_main     Unknown  Unknown
filter             000000000040E16E  Unknown               Unknown  Unknown

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 1529319 RUNNING AT n109-ibj
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@n156.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:1@n156.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@n156.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:0@n147.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:0@n147.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@n147.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:5@n082.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:6@n160.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:6@n160.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:6@n160.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:7@n080.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:7@n080.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:7@n080.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:9@n003.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:9@n003.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:9@n003.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:4@n101.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:4@n101.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:4@n101.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:2@n102.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:2@n102.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@n102.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:8@n158.cmn.juno.cmcc.scc] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:8@n158.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:8@n158.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[proxy:0:5@n082.cmn.juno.cmcc.scc] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:5@n082.cmn.juno.cmcc.scc] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[mpiexec@n147.cmn.juno.cmcc.scc] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec@n147.cmn.juno.cmcc.scc] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec@n147.cmn.juno.cmcc.scc] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec@n147.cmn.juno.cmcc.scc] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
Thu Jul 27 13:19:06 CEST 2023 -- END FILTER
unlink: cannot unlink 'dart_posterior.nc': No such file or directory
'dart_posterior.nc' -> 'clm2_0001.r.2011-01-09-00000.nc'
'clm_restart.nc' -> 'clm5_gswp.clm2_0001.r.2011-01-09-00000.nc'
ERROR: dart_to_clm failed for clm5_gswp.clm2_0001.r.2011-01-09-00000.nc
braczka commented 11 months ago

Hi @tavicoaz -- thank you for the feedback, it is not immediately clear to me yet if the error is related to the inflation file or specific to the cesm2.3 formatting. In the interest of keeping this issue uncluttered could you post your exact same question to the DART help email (DART(at)ucar.edu) and we can better address it there.

braczka commented 8 months ago

Update on this: after discussions with @tavicoaz this is not an immediate issue. Switched to troubleshooting to cesm2.2 since the previous comment. May revisit this later on when cesm2.3 becomes a priority, so should keep this issue open for now.