mksurfdat toolchain: Wrapper tool that handles all the steps needed to create a CTSM surface dataset

slevis-lmwg commented 5 years ago

Starting here with notes from Mike Barlage that explain how he generates CTSM surface data for WRF domains, because we're using the application of WRF domains as the motivation for redesigning the mksurfat toolchain.

Mike's notes:

Creating setup based on WRF domain - CTSM Cheyenne/Geyser

Create SCRIP file from WRF geo_em file

create_scrip_file.ncl

creates two files that are complements of each other only in the mask field

script and data reside: /glade/work/barlage/ctsm/nldas_grid/scrip

Create mapping file in tools/mkmapdata

Modify mkunitymap.ncl with commented lines

+; if ( any(ncb->grid_imask .ne. 1.0d00) )then +; print( "ERROR: the mask of the second file isn't identically 1!" ); +; print( "(second file should be land grid file)"); +; exit +; end if

Link scrip files

ln -sf /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_land_noneg.nc . ln -sf /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_ocean_noneg.nc .

setenv GRIDFILE1 wrf2clm_ocean_noneg.nc setenv GRIDFILE2 wrf2clm_land_noneg.nc setenv MAPFILE wrf2clm_mapping_noneg.nc setenv PRINT TRUE

ncl mkunitymap.ncl

Will throw some git errors if not run in a repo

*** takes a few seconds

Create ESMF mapping files in tools/mkmapdata

qsub regridbatch_barlage.sh

copy stored in ~/src/ctsm/regrid_scripts/

Create domain files in cime/tools/mapping/gen_domain_files/

Build:

cd src/ ../../../configure --macros-format Makefile --mpilib mpi-serial (source ./.env_mach_specific.csh ; gmake) cd ..

./gen_domain -m /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_mapping_noneg.nc -o wrf2clm_ocn_noneg -l wrf2clm_lnd_noneg

creates:

domain.lnd.wrf2clm_lnd_wrf2clm_ocn.180808.nc domain.ocn.wrf2clm_lnd_wrf2clm_ocn.180808.nc domain.ocn.wrf2clm_ocn.180808.nc

copy to /glade/work/barlage/ctsm/nldas_grid/gen_domain_files

*** takes a few seconds

Create surface datasets in tools/mksurfdata_map [ run on cheyenne; takes ~3 minutes ]

billsacks commented 4 years ago

Eventually, I think we want a single python-based tool for this (though we'll probably keep mksurfdata_map in Fortran). In order to break this down into more manageable chunks, I see two possible approaches:

(1) Top down: Start by making a small wrapper script that calls the existing tools (a mix of shell scripts and perl, I think). This will involve working out a reasonable user interface for the high-level wrapper. Then we can start converting the individual tools to python one-by-one; as we do, we can call them directly rather than using subprocesses.

(2) Bottom up: Start by converting each of the individual tools to python one-by-one; create the wrapper after that is all done.

In talking with @mvertens about this, we think that (1) is probably the best approach.

slevis-lmwg commented 4 years ago

Eventually, I think we want a single python-based tool for this (though we'll probably keep mksurfdata_map in Fortran). In order to break this down into more manageable chunks, I see two possible approaches:

(1) Top down: Start by making a small wrapper script that calls the existing tools (a mix of shell scripts and perl, I think). This will involve working out a reasonable user interface for the high-level wrapper. Then we can start converting the individual tools to python one-by-one; as we do, we can call them directly rather than using subprocesses.

(2) Bottom up: Start by converting each of the individual tools to python one-by-one; create the wrapper after that is all done.

In talking with @mvertens about this, we think that (1) is probably the best approach.

The pre-processing step of generating a namelist is discussed here: https://github.com/ESCOMP/CTSM/issues/86

slevis-lmwg commented 4 years ago

I am transferring the latest proposed approach from the google doc to this discussion. Comments welcome:

Numbered TODOs are prioritized first.
The question of when to run the tool-chain relative to the overall flow (e.g. build_clm) has been discussed here and likely needs further thought.
New script-names are just placeholders for now.

A) Using the new wrapper script, create_surface_data

1) Create DST mesh file from user’s land grid. E.g., for WRF use create_mesh_file_wrf.ncl. We leave this UP TO THE USER for now because each modeling group's land grid file will likely look different.

TODO: What script does Ufuk have available for the CESM/CTSM case?

2) Generate namelist (control file) using a modified mksurfdata.pl script.

NB. What we have been referring to as namelist here will be a control file (in a format such as namelist, yaml, config-file format, json, xml). There will also be an internal namelist file read by the mksurfdata_map fortran code that will not involve user modification.

Critical options in mksurfdata.pl: -res ...a supported resolution. No need for a user-specified option because the user will replace the default DST mesh filename in the namelist (control file) with their custom DST mesh filename.m (this can also be a comma-delimited list) -years ...time-slice or range (this can also be a comma-delimited list) -glc_nec ...glacier elevation classes -ssp_rcp ...Future scenario (this can also be a comma-delimited list) -dinlc [or -l] /path/of/root/of/input/data can be removed if build_clm will be done first --rundir /path/of/output/data

Important options in mksurfdata.pl: -vic ...Add the fields required for the VIC model -glc ...Add the optional 3D glacier fields for verification of the glacier model -hirespft ...If you want to use the high resolution PFT dataset -no-surfdata ...Don't output the surface dataset (when you just want the landuse.timeseries file) -no-crop ....Create datasets without the extensive list of prognostic crop types (important for some grids) -help ...Print help

Critical options for PTCLM, but also for our production of several of our grids for testing. @slevisconsulting adding here that we need to decide how to implement these overrides in the tool-chain. I picture the user deciding to insert these in their generated control file in place of the corresponding mksrf files. Or even the mksrf file name strings themselves could be used as comma delimited lists when so desired: -pft_frc "list of fractions"...Comma delimited list of percentages for veg types -pft_idx "list of veg index" ...Comma delimited veg index for each fraction -soil_cly "% of clay" ...% of soil that is clay -soil_col "soil color" ...Soil color (1 [light] to 20 [dark]) -soil_fmx "soil fmax" ...Soil maximum saturated fraction (0-1) -soil_snd "% of sand" ...% of soil that is sand -dynpft "filename" ...Dynamic PFT/harvesting file to use if you have a manual list you want to use -urban_skip_abort_on_invalid_data_check ..Work around for a bug (needed for urbanc_alpha resolution)

Options that are probably still useful, but should be looked into and maybe could be done a different way -exedir ...The location of the mksurfdata_map executable -inlandwet ...If you want to allow inland wetlands -merge_gis ...If you want to use the glacier dataset that merges -fast_maps ...Doesn't run the high resolution 1km mapping file -debug ...Don't actually run, just show what would happen -usr_mapdir "mapdirectory" ...Directory where the user-supplied mapping files are

Options from mksurfdata.pl that can be removed: -allownofile ...Allow the script to run even if one of the input files does NOT exist. -usrname ..."clm_usrdat_name"...CLM user data name to find grid file with. -usr_gname "user_gname"...User resolution name to find grid file with -usr_gdate "user_gdate" ` ...User map date to find mapping files with

mksurfdata.pl already has options and flags for all the currently known cases. We will likely keep most of them.

NOTE: Two requirements. One is that you can still use the Makefile.data makefile to build the standard resolutions needed for CTSM. For this to work, the process must be able to be done with command line options and not require editing of the namelist (control file). The other requirement is that it be able to work with PTCLMmkdata. The easiest way to do that would be with command line options as well.

Another requirement is that there is testing in place for all of this. There could be both unit as well as functional and system testing for the entire system.

Envisioned changes from mksurfdata.pl to gen_mksurf_control: a) New namelist (control file) will NOT include mapping files. The wrapper script will "know" which raw data correspond to which mapping files based on (a) the SRC grid and landmask (new metadata in the raw datasets) and (b) the DST grid. b) New namelist (control file) will now point to a default DST mesh file that users may change. c) New namelist (control file) will continue to list raw datasets. All raw datasets will now include metadata documenting the SRC mesh file, its grid, and its landmask.

TODO2 @slevisconsulting will come up with file naming conventions with @ekluzek and @negin513 and will add the necessary metadata to all raw datasets.

TODO3 @negin513 and @slevisconsulting will replace mksurfdata.pl with gen_mksurf_control that generates the new wrapper script’s control file. Cime namelist-reading python utilities may help in the developmt of a python version of mksurfdata.pl.

3) Run wrapper script create_surface_data pointing to the new control file from step 2. a) Default cases use the control file unchanged. User will modify the control file directly when changing the DST mesh file and/or the raw datasets. b) Raw datasets will now know which “nomask” SRC mesh file and respective mapping file corresponds to them via metadata information. Users will create new SRC mesh files for raw datasets at new grids for which we do not have SRC mesh files. Users will document those in the metadata of their new raw datasets. c) All mesh files should ultimately be in UNSTRUCT format instead of SCRIP as discussed in #648 . The DST mesh file may still need to be in SCRIP format to run gen_cesm_maps.sh and gen_domain, though see (B) below. The wrapper script could handle the conversion of SCRIP SRC mesh files to UNSTRUCT, though we may convert the files preemptively and eliminate the use of SCRIP files. d) create_surface_data will complete the functions of both mkmapdata.sh and mksurfdata_map.sh by default and will not repeat work when mapping files and/or the surface dataset already exist. We can give users the flexibility to request to run the first step only.

TODO4 @negin513 and @slevisconsulting will assemble the functions of mkmapdata.sh and mksurfdata_map.sh in a single script create_surface_data.

B) Generating the domain file TODO @mvertens will replace the need for a domain file with new code in the nuopc/lilac caps that executes during run-time initialization. This will make domain files obsolete!

@mvertens keep in mind when working on this that there are two alternate scripts for step 1 in generating a domain file: gen_cesm_maps.sh (a cime script) that fails when no ocean, so there’s a separate clm tool for generating a regional no-ocean domain file /tools/mkmapdata/mknoocnmap.pl that calls an ncl script.

slevis-lmwg commented 4 years ago

@ekluzek completed TODO1 above by listing important mksurfdata.pl options and options that may be obsolete.

Regarding TODO2 "metadata needed in raw datasets" we decided as follows:

srcmeshfile_scrip_w_mask .../relative/path/as/in/namelist_defaults_ctsm_tools.xml/SCRIPgrid_<grid_name>_<landmask_name>_c<date>.nc srcmeshfile_scrip_nomask .../relative/path/as/in/namelist_defaults_ctsm_tools.xml/SCRIPgrid_<grid_name>_nomask_c<date>.nc srcmeshfile_unstruct_w_mask .../relative/path/as/in/namelist_defaults_ctsm_tools.xml/UNSTRUCTgrid_<grid_name>_<landmask_name>_c<date>.nc srcmeshfile_unstruct_nomask .../relative/path/as/in/namelist_defaults_ctsm_tools.xml/UNSTRUCTgrid_<grid_name>_nomask_c<date>.nc grid_name e.g. 0.5x0.5, ... landmask_name e.g. AVHRR, MODIS, ...

We decided to include four mesh file paths (SCRIP and UNSTRUCT SRC files with and without masks), so as to be prepared for all these options, regardless of whether we complete #823 and #648 .

We decided to omit mapping file names from the metadata of the raw datasets. The script will "know" which raw data corresponds to which mapping file based on the SRC grid_name and landmask_name.

I will create the new raw dataset files with new date-stamps in the file names.

I will add some other clarifications to the proposal text (preceding post).

billsacks commented 4 years ago

When you say you're going to list 4 different mesh file paths, do you just mean in the short-term? That seems okay, but I feel like it could add (significantly) more confusion than value in the long-term, so I hope that by the time we release this new method, we'll have gotten this down to a single mesh file name.

slevis-lmwg commented 4 years ago

You're right @billsacks , ideally this would be for the short term. Alternatively, I could add a single mesh file path src_mesh_file and modify it later as needed, e.g. once if we go to nomask and once if we go to UNSTRUCT files.

The more I think about it, the more I prefer the latter option now. It seemed inefficient, but with my notes the second and third times will be quicker.

ekluzek commented 4 years ago

The thing is each time you change it you make a new raw data file for each of the datasets needed. So you end up making a ton more datasets.

On Mon, Oct 19, 2020, 1:55 PM Samuel Levis notifications@github.com wrote:

You're right @billsacks https://github.com/billsacks , ideally this would be for the short term. Alternatively, I could add a single mesh file path src_mesh_file and modify it later as needed, e.g. once if we go to nomask and once if we go to UNSTRUCT files.

The more I think about it, the more I prefer the latter option now. It seemed inefficient, but with my notes the second and third times will be quicker.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/644#issuecomment-712406131, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYCZQBMDUGUWJWDYLVAVLTSLSKRLANCNFSM4G2EIO6A .

billsacks commented 4 years ago

I like @slevisconsulting 's suggestion for now. I assume there will be a moderately long period in which this is under development, and these new files are just being used on the development branch, not on master. So they don't need to be added to the inputdata repository. Then we can revisit this question if this is getting close to be ready to come to master and #823 and/or #648 are still unresolved.

slevis-lmwg commented 4 years ago

I have added the metadata to the mksrf_ files in a copy that I am keeping here for now: /glade/campaign/cgd/tss/slevis/rawdata (accessible from casper) Feel free to ncdump -h one or two or a few of them to approve my work.

Next is TODO3 that replaces mksurfdata.pl with gen_mksurf_control that generates the new wrapper script’s control file.

slevis-lmwg commented 4 years ago

I will discuss TODO3 in greater detail in #86 to prevent clutter here.

ekluzek commented 3 years ago

I'm thinking about the option "-fast_maps" in current mksurfdata.pl. This is basically covered in issue #450. I think this would be useful to have even in the early versions, because it will make testing so much faster.

ekluzek commented 3 years ago

@slevisconsulting @negin513 and had a good discussion about this a bit today. Looking at @negin513 script that creates the control file to then create both mapping files and the namelist file for mksurfdata_map.

We looked at the list of current mksurfdata.pl options, and decided -debug can be removed for sure. Along the same lines I'd say to remove "-inlandwet" and "-merge_gis" as the namelist could be easily modified to turn those things on.

We had decided to get rid of most of the single-point options. But, in order to create all the datasets that we currently support keeping the following options seems prudent...

-pft_frc "list of fractions"...Comma delimited list of percentages for veg types -pft_idx "list of veg index" ...Comma delimited veg index for each fraction -dynpft "filename" ...Dynamic PFT/harvesting file to use if you have a manual list you want to use -urban_skip_abort_on_invalid_data_check ..Work around for a bug (needed for urbanc_alpha resolution)

I don't see an easy way to create some of the datasets without supporting the above. I don't think supporting the above is too onerous either. And that does get rid of the four single point soil options.

Another good option to keep is -no-surfdata, as that's used in the normal dataset creation. You use that for example when you just want to create landuse.timeseries files for all the different SSP options, but already have the surface dataset needed, so don't want to recreate it.

slevis-lmwg commented 3 years ago

Notes from today's meeting with @negin513 and @slevisconsulting based on our notes in the corresponding google doc:

1) Some details pending in gen_mksurf_namelist.py Transient cases: Specify default path for each ssp/rcp case (transient = .true.) Identify corresponding file names based on the year range (first and last only; long .txt file unnecessary) Give error when any file name in the year range does not exist (as done by .pl script)

Include default values in help page.

Negin, I misspoke when I said that -fast_maps was not needed:

With -fast_maps the namelist will not include mksrf_ftopostats
Without -fast_maps it will include mksrf_ftopostats

Holding off on single point options for now, but keeping in mind @ekluzek 's comments on this topic.

2) Begin wrapper script in python User will enter command at the prompt: mksurf_wrapper.py <same options as they would enter if running gen_mksurf...py> Wrapper script can point to same function as gen_mksurf...py to confirm options and defaults. Wrapper script steps: Generate the control/namelist file: call gen_mksurf_namelist.py Generate temp mapping files: call mkmapdata (.sh for now, .py soon) Generate surface.nc and landuse.nc files: call fortran executable mksurfdata_map

slevis-lmwg commented 3 years ago

@negin513 presented our progress to date in today's CTSM Software meeting. Thanks everyone for the feedback.

I updated the big picture and moving parts of the wrapper script in the schematic according to today's conversation.

The group agreed:

Fortran not changing to Python any time soon because too big a task
To keep the old .txt file for transient simulations (task to be done in python by Negin)
Modify the fortran to read the new namelist instead of requiring the old namelist (task to be done by Sam)

All, pls add or correct anything I may have missed.

billsacks commented 3 years ago

Thanks, @slevisconsulting and @negin513 .

One question about this schematic that I didn't get a chance to raise today: The way you have drawn this seems to imply that create_surface_data.py would run the whole thing at once, including gen_user_namelist.py. But my understanding from our earlier discussion was that there would be a two-step process: you would first run gen_user_namelist.py, then modify the default namelist as you wish, then have a tool that wraps all of the rest of the steps, given that (modified) namelist as input. Is that still the plan?

slevis-lmwg commented 3 years ago

@billsacks I was picturing a wrapper script that runs all the steps at once when generating surface datasets for default resolutions using default raw datasets. The wrapper script would permit the user to stop at step 1 to modify the namelist and/or step 2 to verify that they are satisfied with the mapping files.

I am not attached to this view if the group prefers to separate the first step out of the wrapper script.

billsacks commented 3 years ago

I haven't (yet) developed strong feelings on how this should work. I'm just thinking that it sounds like the most common workflow for users (not CTSM maintainers) would be:

Run something to generate a default namelist
Modify that namelist
Run the rest of the tool chain, taking that namelist as input

For me, if I were doing that workflow, I think it would be most intuitive and least error-prone if (1) and (3) were different tools. I think what you're saying (though I may be misunderstanding) is that there would be one tool (create_surface_data.py) that would operate differently and have different command-line usages depending on what steps you want it to do. My gut feeling is that having a single tool that is smart enough to run (or not run) different steps is good when the most common thing is to want to run all of those steps at once, but that if the most common thing is to want to run certain steps separately, then there should be separate tools for those different steps. Of course, others may feel differently than I do on that.

dlawrenncar commented 3 years ago

I'd agree with Bill on this, but I don't have strong feelings. The main point is that the directions for a user are clear and that they can easily follow them to do the more common thing, which is as Bill described.

On Thu, Dec 17, 2020 at 12:05 PM Bill Sacks notifications@github.com wrote:

I haven't (yet) developed strong feelings on how this should work. I'm just thinking that it sounds like the most common workflow for users (not CTSM maintainers) would be:

Run something to generate a default namelist

Modify that namelist

Run the rest of the tool chain, taking that namelist as input

For me, if I were doing that workflow, I think it would be most intuitive and least error-prone if (1) and (3) were different tools. I think what you're saying (though I may be misunderstanding) is that there would be one tool (create_surface_data.py) that would operate differently and have different command-line usages depending on what steps you want it to do. My gut feeling is that having a single tool that is smart enough to run (or not run) different steps is good when the most common thing is to want to run all of those steps at once, but that if the most common thing is to want to run certain steps separately, then there should be separate tools for those different steps. Of course, others may feel differently than I do on that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/644#issuecomment-747637547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVDXWDABL7AU6WVAZWLSVJI7TANCNFSM4G2EIO6A .

negin513 commented 3 years ago

I agree with both arguments and I think they are not necessarily exclusive options. For example, for default cases we can have the user only running the wrapper :

./mksurf_wrapper.py --res 4x5 -start_year 1850 --end_year 2000

which runs all the steps including namelist generation, since the user does not need to change anything in the namelist. This is similar to what @slevisconsulting is proposing.

However, for other cases where the user wants to modify the namelist, we can have it like this:

./gen_mksurf_namelist.py -s 2000 -e 2015

# Creates some namelist (for example xyz.namelist) . 
#User then modify this namelist for their preferred dst mesh file
#next the user runs wrapper code:

./mksurf_wrapper.py --namelist xyz.namelist

This is similar to what @dlawrenncar and @billsacks are proposing. When the option --namelist is given to the wrapper code the wrapper skips namelist creation step and uses the namelist that it receives via command-line argument.

This is just a suggestion that incorporates both ideas. But I am not attached to it and we can go however the group seems appropriate.

Overall, the workflow can be changed easily to accommodate either of these opinions. Therefore, I suggest not worrying about it too much for now and getting back to this particular issue during later stages of development.

negin513 commented 3 years ago

I'd agree with Bill on this, but I don't have strong feelings. The main point is that the directions for a user are clear and that they can easily follow them to do the more common thing, which is as Bill described. … On Thu, Dec 17, 2020 at 12:05 PM Bill Sacks @.***> wrote: I haven't (yet) developed strong feelings on how this should work. I'm just thinking that it sounds like the most common workflow for users (not CTSM maintainers) would be: 1. Run something to generate a default namelist 2. Modify that namelist 3. Run the rest of the tool chain, taking that namelist as input For me, if I were doing that workflow, I think it would be most intuitive and least error-prone if (1) and (3) were different tools. I think what you're saying (though I may be misunderstanding) is that there would be one tool (create_surface_data.py) that would operate differently and have different command-line usages depending on what steps you want it to do. My gut feeling is that having a single tool that is smart enough to run (or not run) different steps is good when the most common thing is to want to run all of those steps at once, but that if the most common thing is to want to run certain steps separately, then there should be separate tools for those different steps. Of course, others may feel differently than I do on that. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#644 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFABYVDXWDABL7AU6WVAZWLSVJI7TANCNFSM4G2EIO6A .

Another important point which I 100% agree with @dlawrenncar is that: the main point is to have clear instructions for the users. That is why I created the jupyter notebook and I hope we can use this notebook in the future as a tutorial for how to create the surface dataset with different examples.

billsacks commented 3 years ago

For example, for default cases we can have the user only running the wrapper :
./mksurf_wrapper.py --res 4x5 -start_year 1850 --end_year 2000
which runs all the steps including namelist generation, since the user does not need to change anything in the namelist. This is similar to what @slevisconsulting is proposing.

However, for other cases where the user wants to modify the namelist, we can have it like this:
./gen_mksurf_namelist.py -s 2000 -e 2015

# Creates some namelist (for example xyz.namelist) . 
#User then modify this namelist for their preferred dst mesh file
#next the user runs wrapper code:

./mksurf_wrapper.py --namelist xyz.namelist

This seems reasonable as long as:

it's possible to document these two different usages clearly, both in command-line help and in the user's guide - i.e., as long as adding this alternative usage doesn't make it hard for users to understand the more common usage
it can be done in a way that doesn't involve duplication between the gen_mksurf_namelist code and the mksurf_wrapper code: if you need to add a new argument to mksurf_wrapper, it should be possible to do it in one place in the code, even though the argument will need to be available to both mksurf_wrapper and gen_mksurf_namelist.

negin513 commented 3 years ago

@billsacks : I agree with your points.

slevis-lmwg commented 3 years ago

New topic... Negin and I thought some more about modifying the fortran to accept the new namelist and realized that it introduces an element of risk by requiring us to reconstitute the mapping filenames in two places: 1) in mkmapdata.py that generates the mapping files ...and 2) in the fortran that reads and uses the mapping files

Particularly, mkmapdata.py and the fortran would be reconstituting names such as map_3x3min_nomask_to_0.9x1.25_nomask_aave_da.nc by combining the SRC grid info found in the raw datasets and the DST grid info found in the new namelist. If anything changed in our naming conventions, we would need to modify code in both locations. We could minimize this risk by simplifying the file names to not include prefixes and suffixes, but there are still assumptions about putting the SRC in front of the DST grid info, whether to use a connector such as "to" between them, etc.

Unless there's another way around this or people do not consider it an issue, @negin513 and I propose that we return to @ekluzek 's suggestion of recreating the old namelist under the covers before starting the fortran executable.

billsacks commented 3 years ago

Yes, I see your point. However, in thinking about your latest comment, I'm wondering if there may be a bigger issue here that we've been overlooking - or if I'm thinking about things wrong.

The issue I see is: How does mkmapdata.py know which source grid files to use? Your new schematic doesn't seem to address this question, but I think it needs to come from the namelist generated by gen_user_namelist.py, which the user may modify to point to different files.

So I think what we really need is for the first step, gen_user_namelist.py, to NOT generate a Fortran namelist, but instead to generate something that can be read by the next step in the python toolchain. I'd suggest a config (cfg) file because they are easy to work with by hand and in python and are part of the python standard library; we also use config files elsewhere in the LILAC toolchain.

So I think we might need something like the following:

(1) User invokes gen_user_namelist.py (which should be renamed to not have "namelist" in its name) to generate SOMETHING.cfg. This file contains at least two sections: One section gives the raw data file paths, and one or more sections give other user-modifiable inputs to mksurfdata_map.

(2) The user modifies anything they want in SOMETHING.cfg

(3) User invokes create_surface_data.py. This reads SOMETHING.cfg, and:

Uses the raw data files specified there to determine the appropriate inputs to mkmapdata.py
Generates a Fortran namelist to mksurfdata_map in the old format, or something close to it; this will contain:
- the raw data files (copied from SOMETHING.cfg)
- the mapping files (file names generated by this tool)
- other inputs to mksurfdata_map (copied from SOMETHING.cfg)

So in summary: Yes, I am coming to agree that there needs to be this translation step, but I'm also coming to realize that the user-modifiable file needs to be in a python-friendly format, not a Fortran-friendly format (which itself is a driver for needing a translation step).

Does that make sense? Or am I thinking about this the wrong way?

ekluzek commented 3 years ago

@billsacks summary makes the most sense to me as well. It's also a nicely done process that allows it to be easily customized or just run out of the box for standard production cases.

In one of the discussions I had with @negin513 and @slevisconsulting we also talked about the fact that the SOMETHING.cfg file could be in a different format such as cfg, YAML, or JSON or whatever else you want. I like @billsacks suggestion of cfg because it's in LILAC and in the standard python library.

slevis-lmwg commented 3 years ago

I have updated the schematic (see slide 5) to reflect these preferences.

ekluzek commented 3 years ago

@slevisconsulting can you give us access to your slides?

slevis-lmwg commented 3 years ago

I like @slevisconsulting 's suggestion for now. I assume there will be a moderately long period in which this is under development, and these new files are just being used on the development branch, not on master. So they don't need to be added to the inputdata repository. Then we can revisit this question if this is getting close to be ready to come to master and #823 and/or #648 are still unresolved.

This comment reminds me that I need to go back and modify the contents of the raw datasets to point to the nomask SRC files.

slevis-lmwg commented 3 years ago

This comment reminds me that I need to go back and modify the contents of the raw datasets to point to the nomask SRC files.

Once I do that, I will still NOT copy new datasets to /inputdata until I hear otherwise.

slevis-lmwg commented 3 years ago

This comment reminds me that I need to go back and modify the contents of the raw datasets to point to the nomask SRC files.

Done. The files are in /glade/campaign/cgd/tss/slevis/rawdata as before, but now organized in two subdirectories:

/pre_nomask (obsolete, but I didn't want to delete them, yet)
/nomask (with their metadata updated today)

ekluzek commented 2 years ago

With #1663 this becomes obsolete. So closing this as a wontfix.

ESCOMP / CTSM

mksurfdat toolchain: Wrapper tool that handles all the steps needed to create a CTSM surface dataset #644

Mike's notes: