ESCOMP / CAM

Community Atmosphere Model
77 stars 148 forks source link

Make CAM output UGRID Compatible: Workflows for Unstructured Grids #468

Open andrewgettelman opened 3 years ago

andrewgettelman commented 3 years ago

Projet Raijin is developing a python xarray extension called 'uxarray' (https://github.com/UXARRAY/uxarray ). This works on the UGRID standard for meta-data for unstructured grids (https://ugrid-conventions.github.io/ugrid-conventions/).

It would facilitate the development of uxarray and improve our ability to use other common tools if we can get CAM (SE and MPAS) output compliant with the UGRID specifications.

I will also open a UXARRAY issue: https://github.com/UXARRAY/uxarray/issues/11

Initial discussion from @gold2718

Are you talking about adding full grid topology to every history file or the production of separate UGRID files for each SE and MPAS grid (referenced in each history file as a global attribute). We could save both time and space if we collected UGRID files and used existing ones as an input to CAM (for documenting the history files). New UGRID files could be created either by the model or perhaps by Patrick's mesh tool (which might be a lot less total work). I suppose one advantage of having the history file be an official UGRID file is the ability to document (via the 'location' attribute) whether an MPAS variable is defined on a node or on the face. Of course, there is nothing stopping the uxarray project from following a global attribute pointing to the required UGRID file and using that.

From @erogluorhan :

Paul Ullrich was wondering whether the fact that ESMF can support UGRID conventions will make it pretty easy to have CAM grids UGRID compliant (https://earthsystemmodeling.org/docs/release/ESMF_8_0_1/ESMF_refdoc/node3.html#SECTION03028400000000000000). He also mentioned that he’ll look into making TempestRemap provide UGRID-compliant files.

andrewgettelman commented 2 years ago

Dear @brianpm, @gold2718 and @PeterHjortLauritzen ,

I ran into John Clyne at AGU, and we were discussing that it seems like it's time to get CAM to start outputting the information for the UGRID specification for unstructured meshes (e.g. SE and MPAS).  We discussed putting this information out on a separate file that would move with the history files. I know the file/information might already exist in mapping files: maybe it could just be copied. But it would be wise to have the file end up with the history output, since it is to be used for analysis.

Comments? I'd like to get this issue moving forward if we can.

Thanks!

Andrew

swrneale commented 2 years ago

I would say that it is most desirable to include grid information on every history file. However, this would be burdensome (especially from MPAS which has a lot of grid information) if you have a 3-hourly 2D field. Not sure what you mean about 'moving' with the history file?

andrewgettelman commented 2 years ago

Good question Rich. You are right that ideally this would be on every file, but that for high resolution meshes this would be burdensome. What I mean about moving is that it would be copied to the archive directory with the cam.h* files. A standard place that analysis codes could find the file without user intervention.

swrneale commented 2 years ago

That's great. We may want to start a (wait for it) repo where people can upload their grids and UGRID description. Maybe this exists already?

andrewgettelman commented 2 years ago

Fine if we want a repo to collect grids. But, given that people will have arbitrary grids that they create themselves, I think it would be wise if the model workflow copies whatever file is used as input into the location of the history files and it gets moved to an 'archive' directory.

IMHO the model workflow should make this transparent to the user so when an unstructured grid is run, the user will automatically get the right grid information in some file like .ugrid.nc or similar. At least that is what I was thinking. Happy to hear alternatives.

zarzycki commented 2 years ago

Unless the burden is overwhelming (and I'll admit to not being 100% up-to-speed on the UGRID conventions or thinking too hard about the storage burden), I would argue that the optimal solution is to have the grid information stored on each history file if at all possible.

While I agree that a special file (e.g., .ugrid.nc) could also work (and if it "moves" with the history files, then automated tools like diagnostic packages should have no problem finding it), for some boutique cases it is easy to envision someone (e.g., my students) messing up and only archiving some of an output deck and not including that file (this is somewhat analogous to how it can be difficult to track down PHIS fields for VR-CESM if a user doesn't output it in the history but also fails to archive bnd_topo).

I agree with @swrneale's concern about the full 3-D grid being attached to a file with a 2-D history stream, although would it be against UGRID conventions to only define the horizontal grid layout if no variables on that stream required 3-D? That would seem to be much more palatable from a redundancy (or lack thereof) perspective. If I am reading this correctly, perhaps something like:

if this h* stream contains only 2-D fields
    write Mesh2 topology
elseif any 3-D fields 
    write Mesh3D topology
else
    ???
andrewgettelman commented 2 years ago

How big is the UGRID information needed for a global 3.75km MPAS mesh with ~40M columns?

Fine (preferred) to put it on every file as Colin suggests if it's not too big.....my impression was it was potentially going to be big. But that might be wrong. Maybe one of the SE's can figure that out.

erogluorhan commented 2 years ago

Hi all, just checking in: Was there any update on this?

andrewgettelman commented 2 years ago

From @gold2718

Does this apply to all CAM output or just the physics grid? I think this question really only applies to the SE PG (physgrid) runs since that is currently the only time the dycore is on a different grid.

Has anyone priced out the additional disk-space requirements? If the UXARRAY folks insist on the UGRID information being present in every file, we have to deal with the fact that the UGRID format is very inefficient for unstructured grids (as opposed to the ESMF format which eliminates unused vertices). This problem gets much smaller if we can reference a UGRID file in a global attribute. It would also save a lot of time writing history files.

andrewgettelman commented 2 years ago

@gold2718 , @mvertens and @brianpm

We did not map out the disk space requirements for this.

@erogluorhan : do you have an estimate based on the number of columns of how big the UGRID information is? We need to decide if we can put it in every file, or have a global attribute that specifies a file with grid information. We will need to sort that out first with you about what the specification is intending.

I'm assuming you have a solution already proposed in the API? Thanks!

andrewgettelman commented 2 years ago

Here is a link to the UXARRAY API: https://github.com/UXARRAY/uxarray/blob/main/docs/user_api/uxarray_api.md

swrneale commented 2 years ago

In a disk limited world it makes sense not to put it on every output file, as it seems it might be somewhat large. As Steve says maybe just point to a permanent source (zenodo? not NCAR) that can be served through python etc. and also produce the information once on a grid file.

brianpm commented 2 years ago

Having a separate, static file makes sense to me from a data volume perspective. To help users, I wonder whether an option to write fixed fields to a file in a case's output (maybe ".hf." files?) should be added. That'd be useful for other fixed fields, too, e.g., PHIS, LANDFRAC.

gold2718 commented 2 years ago

Having a separate, static file makes sense to me from a data volume perspective.

There is also a huge difference in the amount of SE effort it would take:

The second one is at least 50 times as much effort and will also slow down history output (which already takes too long).

andrewgettelman commented 2 years ago

You make a strong case Steve that we should use an external file for the information that is referred to in an attribute.

I think we will want CESM to output this file during initialization for every run: because grids are going to vary a lot from run to run. It can then get moved around with the history files.

Also: this means that anything on the atmospheric grid (e.g. land, land ice, etc) can easily use the same file.

zarzycki commented 2 years ago

I agree with Andrew that it would be really helpful for scientific reproducibility for that grid file to come at run time and get archived alongside the model output if we don't want to put the grid info on the history files themselves. Or at the very least have st_archive or whatever try to archive such as a file with a simple cp/checkout/etc. step (versus having it generated online at runtime).

IMO, this is particularly important moving into an era of user-generated grids (it's probably not reasonable for NCAR to try and curate a repo of such grids and allowing users to just dump grids into a common repo opens another can of worms).

I think it would be much less of a burden on the user to archive output from a CESM case and know they'll have the UGRID file there, vs. go back two years later and realize the link in the attributes to an external repo is dead for whatever reason. It will also make life less of a headache when publishing given that journals are asking more and more for data to be available beyond just "contact the lead author for access"

erogluorhan commented 2 years ago

@gold2718 , @mvertens and @brianpm

We did not map out the disk space requirements for this.

@erogluorhan : do you have an estimate based on the number of columns of how big the UGRID information is? We need to decide if we can put it in every file, or have a global attribute that specifies a file with grid information. We will need to sort that out first with you about what the specification is intending.

I'm assuming you have a solution already proposed in the API? Thanks!

Andrew, we had already discussed this in our Uxarray meetings and ended up expecting the grid file and history files to be separate from each other. Even though for small datasets we could be good with embedding grid info into history files; one of the main motivations behind the Project Raijin is the scalability, and I do think we should prioritize much larger datasets where we shouldn't embed grid info into every history file.

I am cc'ing @UXARRAY/uxarray-dev to let them know of this, too (Apparently Github doesn't allow to mention teams outside collaborators though).

clyne commented 2 years ago

I just wanted to echo @erogluorhan 's comments that for large data a static file with the grid topology information would certainly be preferable. I also want to make one clarification: the heavyweight connectivity can certainly go in a separate file, but there is UGRID metadata information that should (must?) reside in the same file with the data. I believe that this can be limited in most cases to a single attribute per variable that provides the association between that variable and a grid (in the case that there might be more than one grid defined).

FWIW, NOAA recently wrote a utility for converting GeoFLOW outputs to the UGRID format here.

If it would be helpful I can post an ncdump -h output of one of there data and static grid files.

andrewgettelman commented 2 years ago

Thanks John. Seems like we are all in agreement.

That's a useful point that the attribute is per variable so there can be multiple grids per file. I think that is going to be on the easier side to implement, but is still going to be a component + CESM initialization collaboration to get the right output and links to the files.

gold2718 commented 2 years ago

so there can be multiple grids per file.

Just a note, I'm not sure we can really handle this. The only time we have multiple grids per file is if we are running SE with a physgrid. In this case, we can use the standard UGRID file info for the physics grid but if you want to put dynamics output in the same file, I have no idea how we would get that UGRID info because CESM has no idea about the CAM dynamics grid. The SE dycore also has no current functionality for outputting UGRID format. The same is true of the CSLAM grid (which differs from physgrid for pg2 or pg4).

That said, for science runs, I think that all our history output comes from the physics grid.

clyne commented 2 years ago

I'm not saying that you should have multiple grids per file, only that the UGRID convention allows for this, hence the need for the grid identifier attribute. Sorry if I wasn't clear, @gold2718.

gold2718 commented 2 years ago

@clyne, I was clear on the issues, my concerns are internal to CAM. CAM currently has 5 dycores, each with a different internal grid and data structure. To output UGRID for all of them would be a lot of work. Fortunately, most of the time, the "grid" used by physics is the same as the dycore so we can write once there. The edge case is the Spectral Element fixed-resolution dycore where there may be two or three different grids operating during a run. Writing out UGRID information for all of them would entail a lot more software development time.

philipc2 commented 6 months ago

Following up on this.

The minimum required variables needed to work with UXarray would be the following:

I'd like to also mention that with UXarray, the data doesn't need to be in the UGRID conventions. We can convert MPAS, SCRIP, EXODUS, and ESMF grids to the UGRID conventions at the data-loading step. Some more information here.

There are plans this summer to support user-generated grids from point streams as well, such as Delaunay triangulation or Voronoi tessellations, which could be used generating grids using just the model output.

zarzycki commented 6 months ago

Just spitballing...

We could tie a grid file (UGRID, Exodus, SCRIP, whatever) to the output stream global nc attributes. Perhaps "ux_grid_file: XXXXX" or something. Then have some small logic in uxarray that peeks at the attributes and if it sees that file, point to it. We have all of these files already for various tasks (mainly the coupler, but also creating map files, topo, etc.)

You wouldn't be adding the information directly to the file, but if you were analyzing the output on an NCAR machine (which I assume 98% of CAM users would be) uxarray could just automatically load the grid file from the repo. If you aren't on an NCAR machine (or if a filecheck for exists fails) and the user doesn't go and get this file, it could just throw an error that says "you need this grid file."

gold2718 commented 6 months ago

@zarzycki, A couple of years back (scroll up), you suggested that the grid file be archived along with the history files. I like that idea and it would be more robust than uxarray having to go looking for the file. This could be part of the required input files to be staged before the run and added to the short-term archiver list of files to archive. There is already a CAM namelist item (cam_physics_mesh) for the physics grid info file, we could add others if necessary (e.g., for when dycore variables are output on history files).

The general issue of requiring grid information included in each history file has been discussed here since before the issue was created. Do we have a resolution? @philipc2, would this be an acceptable solution?

philipc2 commented 6 months ago

The general issue of requiring grid information included in each history file has been discussed here since before the issue was created

Yeah, having every history file contain the grid information would be less than ideal, since we only need one instance of the grid definition.

There is already a CAM namelist item (cam_physics_mesh) for the physics grid info file

Does this contain the grid information? If so, is there a way that these could be exported or have a path that refers to them? If you know the path (or have the file somewhere) of the grid file, you can load it into UXarray as follows, without needing every history file to contain grid information.

import uxarray as ux

grid_filepath = "/path/to/grid.nc"

# for multiple history files, can be a list of all history files
data_filepaths = ['a.nc', 'b.nc', 'c.nc']

uxds = ux.open_mfdataset(grid_filepath, data_filepaths)

Is there any way I could get a hold of the mesh files to do some testing?

adamrher commented 6 months ago

The cam_physics_mesh hasn't been generalized for all ATM grids (see namelist_defaults.xml), and it's also redundant since we can grab this from ccs_config, where all grids are defined (https://github.com/ESCOMP/CAM/issues/787).

As @gold2718 mentioned we have discussed adding history file metadata containing the inputdata path to the ESMF mesh file corresponding to the ATM grid you are running. I personally have always loved this idea; if you work with unstructured meshes, you do a lot of regridding, and are constantly referencing the grid files in the inputdata repo to generate weight files on the fly. CESM3 will be the first version of our model where the default ATM grid is an unstructured mesh. I'm anticipating that the CESM forums are going to blow up with questions like "where is the grid file for the case I'm running?" I get emails like this all the time from colleagues trying to get used to working with unstructured output.

philipc2 commented 6 months ago

As @gold2718 mentioned we have discussed adding history file metadata containing the inputdata path to the ESMF mesh file corresponding to the ATM grid you are running.

This is definitely something I could see being useful, especially if you are working on the same system that your history file was generated.

The history files are typically stored as a NetCDF file correct? If so, this could be as simple as adding an attribute that references the source grid and a path to the grid on the system.

gold2718 commented 6 months ago

it's also redundant since we can grab this from ccs_config, where all grids are defined (#787).

I'm not sure this should go here or in the other issue but while I agree that we can replace cam_physics_mesh with the required NUOPC mesh (cam_physics_mesh was developed during the MCT era), it only covers the atmosphere coupling mesh (the so-called 'physics' mesh that is used by CAM to communicate with the surface models through the NUOPC mediator). The SE dycore creates between 2 and 4 grids depending on the details of the simulation and any of them can be used to output fields to CAM history files. Since these grids are internal to CAM, they are not necessarily included in the ccs_config grid files.

One question is that since the bulk of history is written from the physics mesh and most (if not all) of the GLL grids are also represented in ccs_config, do we need to worry about others (e.g., the internal FVM grid) here? In any case, only the physics mesh will be available through the cap so we might still need another mechanism to get the correct mesh filenames into the history file and make sure they are available for post-processing.

gold2718 commented 6 months ago

@philipc2, yes CAM writes NetCDF history files and we would include any necessary grid file pointers as global attributes.

adamrher commented 6 months ago

The SE dycore creates between 2 and 4 grids depending on the details of the simulation and any of them can be used to output fields to CAM history files. Since these grids are internal to CAM, they are not necessarily included in the ccs_config grid files.

I think 3 grids is the max? ne30pg2 uses the GLL grid, the CSLAM grid (FVM 3x3) and the physics grid (FVM 2x2). Oh I guess if you ran WACCM-X you would have that ionosphere field line-based grid. And then HEMCO has it's own "intermediate grid" which I really hope they do away with in the future.

I am comfortable with only providing the physics grid mesh file in the NetCDF attributes. For the most part only alpha users would ever look at output on the CSLAM or GLL grid. An exception might be if users need to generate a custom ncdata file, since the se dycore requires that to be on the GLL grid for all varieties (ne30np4, ne30pg2, ne30pg3). But I think we are covering 99% of all cases if we just provide the physics grid mesh.

I guess a WACCM-X person could empty_htapes and then populate the tapes only ionosphere grid vars, in which case having the physics grid in the attribute would be misleading. A similar issue would occur if you set interpolate_output=T when running the se dycore. Perhaps we could just add an explicit statement in the attributes specifying that this is the computational physics grid, and is not guaranteed to match the grid in the history output.

gold2718 commented 6 months ago

I think 3 grids is the max?

Is the INI grid no longer needed (edge case for input from ncol dimension)? If so, we should remove that. However, this is getting into technical CAM territory so I'm moving most of my response to #787.

My vote would be to try to provide the correct mesh file for for every variable on a history file and I think this is technically possible. If a simple interface is required (say a fixed global attribute name specifying a single mesh per file), we can restrict files to one-grid-per-file for production runs.

adamrher commented 6 months ago

Agreed to moving the cam_physics_mesh discussion over to https://github.com/ESCOMP/CAM/issues/787 since @gold2718 has identified that some of the meshes may not be in ccs_config.

We should probably open this grid attribute concept in a separate issue. Users need the mesh file for remapping, but in the meantime the interpolate_output namelist option maps history fields to a user specified target lat-lon grid, and that is adequate for most. I would like also like the grid attribute to be implemented in a correct and methodical way as @gold2718 describes. It realistically won't get the attention it needs until after the CESM3 release.

@philipc2 to your earlier question:

Is there any way I could get a hold of the mesh files to do some testing?

These are stored on the cesm inputdata server, on glade /glade/campaign/cesm/cesmdata/inputdata/share/meshes/