NGEET / fates

repository for the Functionally Assembled Terrestrial Ecosystem Simulator (FATES)
Other
100 stars 92 forks source link

Single-site simulations are not starting. #754

Closed mpaiao closed 3 years ago

mpaiao commented 3 years ago

I am new to FATES and I am trying to create some test simulations for a single site. I am able to successfully create the case, however the simulations are crashing right at the beginning, even before FATES is called.

For reference, I used this shell script to create the case (simple simulation for BCI), and I got the following error reported in the CESM log:

PIO: FATAL ERROR: Aborting... FATAL ERROR: NetCDF: Index exceeds dimension bound (<path_to>/E3SM/externals/scorpio/src/clib/pio_darray_int.c: 1545)
Obtained 1 stack frames.
0   e3sm.exe                            0x000000010e45548b print_trace + 36

A few additional attempts:

  1. I tried to generate the case for both CESM and E3SM, and both failed at the same place.
  2. I adapted my shell script to run the CTSM-FATES workshop examples (walkthrough example 2, 1x1_brazil), and that one ran fine.
  3. I also tried to run the model for a completely different site (using these R scripts to generate the drivers), but I got the same problem.

Does anyone have any ideas or suggestions? Thanks!

rgknox commented 3 years ago

You might get more information from the crash by setting the pio debug level higher, I think it goes as high as 6, it defaults at 0

./xmlchange PIO_DEBUG_LEVEL=6

Are you generating the same error using datasets that have worked in the past? For instance, the BCI run you mention, did that have any modified surface/domain/parameter/met-driver files?

Have you tried turning off FATES and had any luck with the big-leaf model?

mpaiao commented 3 years ago

It's my first time trying to run with BCI data, so I don't have a reference for a successful run. The only change I made in the data was to update the paths in bci_inv_file_list.txt, otherwise, I'm using the same as I got from you (bci_0.1x0.1_v4.0).

I ran again the model with PIO_DEBUG_LEVEL=6, it produced a lot more error messages but this seems to be the same problem as in the previous attempt.

    0 PIOc_inq_varid ncid = 128 name = ZBOT
    0 PIOc_inq_var ncid = 128 varid = 1
        0 pio_get_file ncid = 128
        0 Calling the netCDF layer
            0 nc_inq_varndims called ndims = 2
            0 my_name = ZBOT my_xtype = 5 my_ndims = 2 my_natts = 3
    0 PIOc_setframe ncid = 128 varid = 1 frame = 743
        0 pio_get_file ncid = 128
    0 PIOc_read_darray ncid 128 varid 1 ioid 513 arraylen 1 
        0 pio_get_file ncid = 128
        0 pio_read_darray_nc_serial vid = 1
            0 fndims 2 ndims 2 vdesc->record 743 vdesc->ndims 2
Abort with message unexpected record in file /Users/marcoslongo/Dropbox/Home/Models/CTSM/cime/src/externals/pio2/src/clib/pio_darray_int.c at line 1446
Obtained 1 stack frames.
0   cesm.exe                            0x0000000103ba2a0c print_trace + 34

(The complete log file is here).

mpaiao commented 3 years ago

I am trying to set up the case with FATES turned off. Sorry for all the basic questions, is this done setting the following xml change?

./xmlchange CLM_BLDNML_OPTS="-bgc cn -no-megan"

I tried -bgc none as it was just for test, but it said this is not a valid option. I submitted the simulation, but it is currently downloading a lot of very large files.

rosiealice commented 3 years ago

Hey Marcos,

Have you tried any of the out of the box configurations?

e.g. box 4 in my "help I've forgotten how everything works" cheat sheet ;) https://github.com/rosiealice/fates/wiki/Rosie's-developer-instructions

I suspect that this sort of PIO error would likely show up whatever configuration you use, but it's good to check at least. My hunch is that it's something to do with python versions....

mpaiao commented 3 years ago

Many thanks for the cheat sheet, Rosie, this is gold! Bookmarked here ;)

I was able to run the vanilla CLM5 (and the ELM equivalent), and also the 2019 Workshop walkthrough examples. I also created and successfully ran all these tests using this generic shell script.

The error is specific to the single-site simulations, so maybe python is fine, and I'm just messing some of the configurations when running a single site.

mpaiao commented 3 years ago

A few updates:

  1. I tried to create the case with the same script Ryan shared with Lin, who in turn shared with me (my only changes were limited to setting the paths on my computer). The simulation failed the same way as it is failing with my script.
  2. I tried to generate the surface data file for Paracou using the same surface data file that the Walkthrough Run 2 automatically downloads. I tried to run this site without forest inventory initialisation (so it was similar to Run 2), but it didn't help, the crash is exactly the same as for BCI.
  3. I went through every difference in the namelists for these runs. Most of them look trivial but some may be true differences.
    • datm_in. Many variables have 5 values in the walkthrough example, but only 3 values in the single-site case. Does this matter? For example (not that the mapping algorithm is also different):
    • fillread = "NOT_SET", "NOT_SET", "NOT_SET", "NOT_SET", "NOT_SET"
    • fillwrite = "NOT_SET", "NOT_SET", "NOT_SET", "NOT_SET", "NOT_SET"
    • mapalgo = "bilinear", "bilinear", "bilinear", "bilinear", "bilinear"
    • fillread = "NOT_SET", "NOT_SET", "NOT_SET"
    • fillwrite = "NOT_SET", "NOT_SET", "NOT_SET"
    • mapalgo = "nn", "nn", "nn"
    • drv_in. I see start_type="startup" in the walkthrough test, and start_type="continue" for the single-site. I left all the differences between namelists in this log file.

Also for reference, these are the commits I am using:

adamhb commented 3 years ago

I am having the same problem, but was able to run the versions of ctsm/fates mentioned above when I reverted to an earlier version of cime (tag: cime5.8.32). Perhaps something in the most recent cime code is causing the issue?

mpaiao commented 3 years ago

@adamhb This is interesting. I tried the same cime version, but it did not work for me. I actually tried several versions between 5.8.16 and the current one, and none of them worked, though the error message varied for different versions. Would you mind sending me the settings you used for the test that ran successfully? I'd like to compare with my settings with yours. Thanks!

adamhb commented 3 years ago

@mpaiao No problem. Keep in mind that I'm very new to running FATES myself! I attached my bash script (had to attach as .txt) that I'm using to build the case that worked on Lobata.

Jessie pointed out a couple key things that you might need to change if you're not doing them already:

  1. You'll need an up to date parameter file (I created a .nc from the .cdl that comes with the version of FATES you are using)
  2. you need to add parteh_mode = 1 in the name list options to run the carbon only model. (see the section of the attached script titled "# MODIFY THE CLM NAMELIST (USERS MODIFY AS NEEDED)"
  3. Note, I'm also running with use_fates_ed_prescribed_phys = .false. in the name list options
  4. domain file: domain_bci_clm5.0.dev009_c180523.nc
  5. surface data file: surfdata_bci_clm5.0.dev009_c180523.nc

cime/ctsm/fates tags: cime5.8.32; ctsm5.1.dev042; sci.1.46.0_api.16.0.0

I am using Python 3.7. Let me know if you need me to send any of the param or driver files I mentioned above, or if you need to know anything else about this run!

bci_case_build_forMarcos.txt

mpaiao commented 3 years ago

Thanks @adamhb. I tried and it still fails here, with the same error.

For reference, I am running the model on my local computer (Mac Big Sur 11.4). I am using gcc 11.1.0_1 (so gcc-11, gfortran-11, g++-11) and Python 3.9.5, and compiling the code with mpi-serial. GNU compilers and python are just the default ones for homebrew. The XML configuration files I am using are here.

adamhb commented 3 years ago

@mpaiao . Ok, these details are very new to me, but I've attached the machine and compiler configuration files and a software environment file for the run that might be helpful to you.

One other thought: For Lobata users, we followed [Greg's new user setup instructions] (https://github.com/glemieux/fates-scratch/blob/master/Notes/lobata/NewUserSetup.md) which includes making sure we have access to some specific programs (including some that seem to have NETCDF functionality). You may want to look at those setup instructions to make sure your local machine has all the programs it needs?

config_compilers.txt ahb_software_environment.txt config_machines.txt

Adam

mpaiao commented 3 years ago

Thanks for the files, Adam! I will compare the configurations and see if I can spot something promising.

rgknox commented 3 years ago

@mpaiao and @adamhb . Is this crash occurring during a non-FATES run? If so, this issue might not be getting visibility with the folks who are best equipped to help solve this. We could open a new issue and link this thread from CTSM.

rgknox commented 3 years ago

We discussed this at the CTSM software meeting this morning. One point of conversation was the use of PIO2 or PIO1 (old and new version of parallel I/O software stack). For simulations on my linux workstation, I override the default (PIO2) and set PIO1, via:

./xmlchange PIO_VERSION=1

After reverting to PIO 2, my simulations crash as well:

bort with message unexpected record in file /raid1/rgknox/SyncLRC/ctsm/cime/src/externals/pio2/src/clib/pio_darray_int.c at line 1446
Obtained 10 stack frames.
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc75b82) [0x55cf569bcb82]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc75c4d) [0x55cf569bcc4d]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc75c7f) [0x55cf569bcc7f]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc9f1a5) [0x55cf569e61a5]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc9d826) [0x55cf569e4826]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc692f6) [0x55cf569b02f6]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc6e62a) [0x55cf569b562a]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xb86ba6) [0x55cf568cdba6]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xb8edcb) [0x55cf568d5dcb]
/raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/cesm.exe(+0xc05f77) [0x55cf5694cf77]

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f10052462ed in ???
#1  0x7f1005245503 in ???
#2  0x7f10048c303f in ???
#3  0x7f10048c2fb7 in ???
#4  0x7f10048c4920 in ???
#5  0x55cf569bcc51 in piodie
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/externals/pio2/src/clib/pioc_support.c:561
#6  0x55cf569bcc7e in pioassert
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/externals/pio2/src/clib/pioc_support.c:582
#7  0x55cf569e61a4 in pio_read_darray_nc_serial
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/externals/pio2/src/clib/pio_darray_int.c:1444
#8  0x55cf569e4825 in PIOc_read_darray
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/externals/pio2/src/clib/pio_darray.c:939
#9  0x55cf569b02f5 in read_darray_internal_real
    at /raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/gnu/mpi-serial/debug/nothreads/mct/pio/pio2/src/flib/piodarray.F90:349
#10  0x55cf569b5629 in __piodarray_MOD_read_darray_1d_real
    at /raid1/rgknox/Models/land_runs/bci-test-radloop-hifrq-v1.C88b613478-F8c8da995.2021-06-24/bld/gnu/mpi-serial/debug/nothreads/mct/pio/pio2/src/flib/piodarray.F90:331
#11  0x55cf568cdba5 in shr_dmodel_readstrm
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/share/streams/shr_dmodel_mod.F90:965
#12  0x55cf568d5dca in __shr_dmodel_mod_MOD_shr_dmodel_readlbub
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/share/streams/shr_dmodel_mod.F90:699
#13  0x55cf5694cf76 in __shr_strdata_mod_MOD_shr_strdata_advance
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/share/streams/shr_strdata_mod.F90:891
#14  0x55cf55e44c1c in __datm_comp_mod_MOD_datm_comp_run
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/components/data_comps_mct/datm/src/datm_comp_mod.F90:664
#15  0x55cf55e50cf5 in __datm_comp_mod_MOD_datm_comp_init
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/components/data_comps_mct/datm/src/datm_comp_mod.F90:568
#16  0x55cf55e4402c in __atm_comp_mct_MOD_atm_init_mct
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/components/data_comps_mct/datm/src/atm_comp_mct.F90:172
#17  0x55cf55d813ea in __component_mod_MOD_component_init_cc
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/drivers/mct/main/component_mod.F90:248
#18  0x55cf55d70164 in __cime_comp_mod_MOD_cime_init
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/drivers/mct/main/cime_comp_mod.F90:1415
#19  0x55cf55d7d8bd in cime_driver
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/drivers/mct/main/cime_driver.F90:122
#20  0x55cf55d7dad8 in main
    at /raid1/rgknox/SyncLRC/ctsm/cime/src/drivers/mct/main/cime_driver.F90:23
Aborted (core dumped)

Will dig into this some more.

adamhb commented 3 years ago

Hi @rgknox, I haven't tried a non-FATES run, but it does seem like a CTSM/CIME issue, because the problem went away for me when I switched to an earlier version of cime (tag: cime5.8.32). For now I'm just using this older version of cime for my simulations. When I went back to the most recent version of cime it crashed again, and the PIO option you mentioned didn't seem to make a difference.

mpaiao commented 3 years ago

@rgknox It depends on the settings. Below is a summary of my success/failure so far (this was before the PIO_VERSION suggestion). Almost all cases were successfully created, failure only occurred when I submitted the cases. The scripts I used for each case are here.

Script Description FATES RES COMPSET Status
B0000-CLM Vanilla CLM off f45_f45_mg37 I2000Clm50BgcCrop Success
B0000-ELM Vanilla ELM off f45_f45_mg37 IELMBGC Success
F0001_SimpleCase_CLM Walkthrough 1 on 1x1_brazil I2000Clm51Fates Success
F0001_SimpleCase_ELM Walkthrough 1 on 1x1_brazil IELMFATES Success
B0001_SimpleCase-CLM Walkthrough 1 off 1x1_brazil I2000Clm51Bgc Success
B0001_SimpleCase-ELM Walkthrough 1 off 1x1_brazil I2000ELMBGC Success
F0002_BrazilTest_CLM Walkthrough 2 on 1x1_brazil I2000Clm51Fates Success
F0002_BrazilTest_ELM Walkthrough 2 on 1x1_brazil IELMFATES Success
B0002_BrazilTest-CLM Walkthrough 2 off 1x1_brazil I2000Clm51Bgc Success(1)
B0002_BrazilTest-ELM Walkthrough 2 off 1x1_brazil I2000ELMBGC Success(1)
F0003_BCINoInv-CLM BCI test (no inventory) on CLM_USRDAT I2000Clm51Fates Failure(2)
F0003_BCINoInv-ELM BCI test (no inventory) on ELM_USRDAT IELMFATES Failure(2)
B0003_BCINoInv-CLM BCI test (no inventory) off CLM_USRDAT I2000Clm51Bgc Failure(2)
B0003_BCINoInv-ELM BCI test (no inventory) off ELM_USRDAT IELMFATES Failure(3)

(1) It worked when I excluded the variables by age and size class from hist_fincl1. I guess this makes sense. (2) Runs failed with or without the ./xmlchange PIO_VERSION=1 settings. (3) Compilation failed. I am attaching the log here, it looks related to the crashes.

mpaiao commented 3 years ago

I identified the issue. Based on the single-point driver for Mexico City (one of the standard cases that uses single-point meteorological data), the variable ZBOT must be a time series, not a single, time-invariant variable. I re-created the meteorological forcing for my test site and CLM-FATES is now running fine on my local computer. The current drivers for BCI (bci_0.1x0.1_v4.0i) do not seem to be compatible and may need to be updated.

For reference, I updated the code I used to generate initial conditions compatible with the current code:

Note: the current default reference data used for the surface data file is not compatible with ELM (e.g., it doesn't have all the variables needed). I am still looking for a better reference. This has been fixed.

mpaiao commented 3 years ago

It seems ELM requires a variable T_BUILDING_MAX in the surface data, but CLM does not. This seems to be used by the urban land unit. For now, I updated the script to generate the missing variable in make_fates_domain+surface.Rmd based on a simple relation with T_BUILDING_MIN. This is done only when T_BUILDING_MAX is missing from the reference surface data. The surface data file that this script creates a netCDF file compatible with both CLM and ELM.

adamhb commented 3 years ago

Well done! And thanks for letting me know about the solution and sharing your scripts for making met driver, surface, and domain files. Do you now have updated versions of these files for a BCI single site simulation with CTSM? Adam

On Sun, Jun 27, 2021 at 11:00 PM Marcos Longo @.***> wrote:

I identified the issue. Based on the single-point driver for Mexico City (one of the standard cases that uses single-point meteorological data), the variable ZBOT must be a time series, not a single, time-invariant variable. I re-created the meteorological forcing for my test site and CLM-FATES is now running fine on my local computer. The current drivers for BCI (bci_0.1x0.1_v4.0i) do not seem to be compatible and may need to be updated.

For reference, I updated the code I used to generate initial conditions compatible with the current code:

Note: the current default reference data used for the surface data file is not compatible with ELM (e.g., it doesn't have all the variables needed). I am still looking for a better reference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/fates/issues/754#issuecomment-869380499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBXLIK2IOEHBVKEXRAJTADTVAFXTANCNFSM47CSAHTQ .

-- Adam Hanbury-Brown PhD Student Energy and Resources Group UC Berkeley