Dynamic generation of met_data_file names prevents check_input_data from downloading needed data

billsacks commented 4 years ago

This issue started in the forums:

https://xenforo.cgd.ucar.edu/cesm/threads/issue-downloading-merra-files-during-testing.4952/

The problem seems to be that CAM's namelist lists a single met_data_file, e.g.

met_data_file       = '2000/MERRA2_1.9x2.5_20000101.nc'

and then other file names (with other dates) are generated dynamically at run time.

However, this means that check_input_data does not know about these other files, so they are not automatically downloaded (even though they do exist in the input data repository).

This should be changed in some way to allow check_input_data to work properly, downloading all needed files for a case. Two options that come to mind are:

(1) All file names can be generated and listed at build-namelist time

(2) check_input_data can be enhanced to understand that, in some cases (like this one) it needs to get all files from a given directory, rather than just the specific listed file.

I have reproduced this error both in CESM2.0.1 (where the user reported the issue) and in a recent version of the cesm2.1 alphabranch. I reproduced it by running

./create_test --no-build SMS_Ld1.f19_f19_mg16.FXSD.bishorn_gnu.cam-outfrq1d (where bishorn is my local machine), then running ./preview_namelists followed by ./check_input_data, and noting that only one of the MERRA files is considered by check_input_data.

fvitt commented 4 years ago

How much of the met data do we want to place into the inputdata repo? 20+ years of MERRA data at more than 1 resolution is a lot of data to put into the repo.

billsacks commented 4 years ago

Per earlier CSEG discussions: all data needed for any supported cases should be in the inputdata repository. Note that (thanks to @jedwards4b ) we now have multiple servers that can be used to host inputdata, so it may not be necessary to put this in the svn inputdata repo if a different place makes more sense. How much data volume does this involve?

fvitt commented 4 years ago

This seem very impractical when a user wants to run only 1 month and the scripts are required to download 20+ years of MERRA data..

billsacks commented 4 years ago

There seem to be a few separate issues here, so I think it's important to address them separately:

(1) What forcing data should be in the inputdata repository?

CSEG policy: all data needed to run any supported case.

(2) What data should be automatically downloaded by the check_input_data script?

CESM standard: all data needed to run the given case.

(3) Can a mechanism be put in place to prevent the need for downloading a full data set when only a subset of it is required?

@fvitt This is your last point. Sure, this is possible; it will just take a bit more work on your end to set this up, including some smarts in the script to generate the list of needed files. My main point, though, is that this is a separate question from (1) and (2).

jedwards4b commented 4 years ago

@fvitt the scripts do not need to download 20+ years of data - only what the buildnml requests and buildnml could use the date information from the case xml to determine what it needs to download.

billsacks commented 4 years ago

From discussion with @fvitt , @jedwards4b Simone Tilmes and Chi-Fan Shih (I may be missing some points; @fvitt or @jedwards4b please add any important details that I'm missing):

It probably isn't worth putting all of the data needed for SD in the inputdata repository. The current solution of using the RDA should be sufficient for this. (Note that the RDA requires user registration, so can't be accessed via CESM's inputdata scripts.)

However, we should at least ensure that the data needed for all tests in the test suite are available in inputdata and are obtained automatically via the check_input_data script. In particular, see https://xenforo.cgd.ucar.edu/cesm/threads/issue-downloading-merra-files-during-testing.4952/ - enough data are needed so that SMS_Ld1.f19_f19_mg16.FXSD.hydra_gnu.cam-outfrq1d can be run from any system.

CAM's build-namelist should be extended to be smart enough to know which data are needed for a given case, based on xml values it can query from the case (start date and run length). All necessary files should be added to the list of necessary files that is read by check_input_data.

In addition, if any files are not available in the inputdata repository, a message should be printed by CAM's build-namelist giving instructions on how to access these data.

brian-eaton commented 4 years ago

I would suggest that the easiest solution to this issue (which occurs in several other CAM data input streams) is to update the test definition so the the necessary file for met_data_file is specified in the user_nl_cam file.

jedwards4b commented 4 years ago

@brian-eaton that will fix the test, but why not fix cam's buildnml so that the files are listed in cam_in so that it will work for regular cases as well as tests.

billsacks commented 4 years ago

While @jedwards4b 's solution feels right long-term, it seems like @brian-eaton 's idea is fine for the main issue at hand, which is allowing tests in the test suite to run. As @fvitt and Simone said, users of SD compsets are accustomed to manually downloading the needed data, so the main problem right now is users of the test suite.

ESCOMP / CAM

Dynamic generation of met_data_file names prevents check_input_data from downloading needed data #53