ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
300 stars 306 forks source link

Update soils data used for surface dataset #1303

Closed wwieder closed 1 year ago

wwieder commented 3 years ago

It would be nice to update the soils data we're using to generate the surface dataset to something from this century. This will introduce a number of answer changes to the code, but it seems worth having a discussion about what we need here.

@dlawrenncar suggested using SoilGrids data, which just released a version 2.0 of their dataset https://doi.org/10.5194/soil-2020-65. SoilGrids2.0 contains information on soil texture, OC content, pH, bulk density, coarse fragments, CEC, and soil N at 250 m resolution for 6 soil layers (0-200 cm). This high resolution data also includes uncertainty estimates! According to the data providers, v2.0 has changed significantly from previous releases of the dataset, but is currently only available at 250m resolution.

Laura Poggio and Niels Batjes at ISRIC are interested in and willing to provide a coarser resolution data product for our purposes and wondered what we wanted. I've basically told them we'd like the whole dataset, but to prioritize texture and soil C information. Is a 5km data product adequate for NWP applications, but not too unwieldy for climate simulations? Do we need 1km resolution mapping flies?

I also wondered if we should think about how to generate soil properties for the hillslope model? Does this happen in our own tool chain, or could it be generated in the mapping files from ISRIC? This is likely of secondary concern, but may be worth discussion?

mvertens commented 2 years ago

@ekluzek - I agree it would be good to have you there.

wwieder commented 2 years ago

Another option here. Neils Batjes responded to a query about regridding soils data and suggested the following


...the last global dataset that we have generated using the spatial data of the HWSD linked to a set of derived soil properties per mapping unit (grid cell) is called WiSE30sec. This dataset was developed following the Breckenridge meeting in 2014.

See: Batjes NH 2016. Harmonised soil property values for broad-scale modelling (WISE30sec) with estimates of global soil carbon stocks. Geoderma 269, 61-68. http://dx.doi.org/10.1016/j.geoderma.2016.01.034

The dataset itself is available at the [ISRIC website](https://data.isric.org/geonetwork/srv/eng/catalog.search#/metadata/dc7b283a-8f19-45e1-aaed-e9bd515119bc)_


I've downloaded these data to scratch and generated a .nc file from a .tif file that has information on mapping units. I'm somewhat included to use this data, instead of the SoilGrids data, but we can discuss next week.

All data can be found here:/glade/scratch/wwieder/wise_30sec_v1/WISE30sec/

If we're going to use this:

uturuncoglu commented 2 years ago

@wwieder Sorry for late response. I am trying to create SCRIP file now. It takes time. Is this file has higher resolution than the previous one. Anyway, I'll let you know when I have something to update.

wwieder commented 2 years ago

Awesome. Thanks for the update

On Tue, Mar 15, 2022, 1:43 PM Ufuk Turunçoğlu @.***> wrote:

@wwieder https://github.com/wwieder Sorry for late response. I am trying to create SCRIP file now. It takes time. Is this file has higher resolution than the previous one. Anyway, I'll let you know when I have something to update.

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1303#issuecomment-1068395678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5IWJC2WWFFPOUZSFTTJYLVADR4VANCNFSM4ZKQKIKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mvertens commented 2 years ago

@Ufuk Turuncoglu @.***> - thanks so much for your help with this!

On Tue, Mar 15, 2022 at 1:43 PM Ufuk Turunçoğlu @.***> wrote:

@wwieder https://github.com/wwieder Sorry for late response. I am trying to create SCRIP file now. It takes time. Is this file has higher resolution than the previous one. Anyway, I'll let you know when I have something to update.

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1303#issuecomment-1068395678, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4XCE2NKQYYGDRUBYCFLIDVADR4VANCNFSM4ZKQKIKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Mariana Vertenstein CESM Software Engineering Group Head National Center for Atmospheric Research Boulder, Colorado Office 303-497-1349 Email: @.***

uturuncoglu commented 2 years ago

@mvertens sure. the SCRIP generation in 5-hours and still running. I hope it will end in 12 hours. Unfortunately, there is no way to make it parallel since it uses NCL.

wwieder commented 2 years ago

I've combined the lookup table with soil properties and the gridded mapunits (MU) onto a single .nc file. /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1.nc

The soil properties lookup table has been modified to the 10 levels expected for CLM.

Note the WISE dataset has an additional dimension SCID in the lookup table for different soil types that are within each mapping unit. The first index of SCID corresponds to the dominant, or most common, soil profile within each mapping unit, so I suggest we take this one. Hopefully this doesn't add too much more complexity to @mvertens's code.

I'm not sure if coastlines are being handled correctly when I ncview the file, but it seems OK when plotting in python hopefully this isn't much of an issue.

image

Once @uturuncoglu has a mesh file, hopefully this works for @mvertens and @olyson .

uturuncoglu commented 2 years ago

@wwieder i am not sure why but I could not generate SCRIP file in 12 hours with NCL way. I'll try another way to see what happens.

wwieder commented 2 years ago

updated notebook is here. @ekluzek and @dlawrenncar is this something we want to keep around? If so, where should it go?

uturuncoglu commented 2 years ago

@wwieder I could able to create SCRIP grid definition with the Python way. Now, I am trying to convert it to ESMF mesh. I'll update you once it is done.

uturuncoglu commented 2 years ago

@wwieder mesh generation is failing. I am not sure why since there is no any valuable info. Probably, due to the memory. I tried to run it on 32 fat nodes with 1 MPI process in each. This was working before with the old file SoilGrids_mean_5000_merged but not at this time. Is this file higher resolution than the previous one. Since ESMF_Scrip2Unstruct is not good at handling memory (it requires reading entire SCRIP in each MPI process at this point). I'll open a ticket in ESMF side but this is a limitation currently we have.

mvertens commented 2 years ago

@uturuncoglu - thanks so much for the progress so far. As we move to higher and higher resolution grids and we require mesh input it does seem like having a parallel Scrip2Unstruct will be critical. We are stuck at this point.

uturuncoglu commented 2 years ago

@mvertens yes, that is a limitation at this point. The work on bringing PIO2 to ESMF might solve the issue but it is in the itial stage and it could take time to have it that capability in Scrip2Unstruct.

mvertens commented 2 years ago

@uturuncoglu - I'm gong to try to set up a meeting today to see what the path forwards would be. I believe that we would have the same problem creating mapping files offline with this resolution- since there is a conversion to a mesh under the covers I believe. Do you agree?

mvertens commented 2 years ago

@jedwards4b - it would be good to get your input on this as well - given the work you did on PIO2 in ESMF.

uturuncoglu commented 2 years ago

@mvertens sure. could you setup it after 2pm if it is also suitable for you. I am also testing esmf_mesh.py to create ESMF Mesh file directly. So, I'll let you know how it goes. Since it is not working all the cases, it might fail.

wwieder commented 2 years ago

Thanks all for your work on this. Don't we have a mesh file for another dataset of similar resolution already (I'm thinking of the DEM used for topography)? If so, how was this mesh file created, or is the DEM at a lower resolution?

mvertens commented 2 years ago

@wwieder - the ESMF group has proposed the following potential path forwards:

  1. Convert your soil texture data file to the CF Single Tile metadata convention - using this example from the ESMF reference manual. The reason is that this format can be read in in parallel with any recent version ESMF.
  2. Add an option to mksurfdata_esmf to ingest a CF file using this ESMF_GridCreate() API. This grid object can be used in the regridding just as the mesh is currently used. You should be able to not specify the fileformat flag and it should automatically detect the format.

This approach should allow you to make progress right away and can be implemente If 1. is acceptable - I can do 2. I'll be meeting with the Rocky and Jim at 3 today. Let me know if you want to join.

mvertens commented 2 years ago

@wwieder - I just spoke to @rsdunlapiv to get more clarification on their proposed idea. What we need is not for the whole file to be converted but just a new file that has the lats,lons, lat_bnds and lon_bnds in the format given by http://earthsystemmodeling.org/docs/nightly/develop/ESMF_refdoc/node3.html#SECTION03028300000000000000. If you could provide me with that new file - I would add the capability in mksurfdata_esmf to read in that file instead of the corresponding mesh file and generate the route handle, etc using this new file. I think this would be the best approach moving forwards. Please let me and @rsdunlapiv know if this sounds okay with you.

wwieder commented 2 years ago

Sure, I think this should be easy enough to do. One comment and then a few questions:

  1. Where do you want the associated lookup table to go?
  2. What are lat_bnds and lon_bnds? From the example it looks like they are just lat (or lon) + a bound dimension (which is 2). Do you need any data in that dimension? If so, is it the midpoint between the lat and lon values? Is there a simple way to generate this using ESMF (or other) tools?
rsdunlapiv commented 2 years ago

@wwieder on question 2, the lat_bnds and lon_bnds will be used to populate the corner coordinates surrounding each grid point in the ESMF Grid. The corners are required in order to do conservative regridding because you need to be able to compute cell areas. @oehmke do you have any suggestions for computing these? If not, perhaps computing a mid point is a reasonable approach.

wwieder commented 2 years ago

@rsdunlapiv and @mvertens see if this is what you need /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid.nc

lookup table is here /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_lookup.nc

wwieder commented 2 years ago

One more note, the way I did this lon_bnds has values >180 and <-180. Is this OK, or do I need to adjust these edge values?

mvertens commented 2 years ago

@wwieder - thanks for this. I don't think we need the separate lookup table file. What I am planning to do is use your original data file but use the new grid file you generated instead of the mesh file. @oehmke and @rsdunlapiv - is it okay to have lon_bnds > 180 and < -180.

oehmke commented 2 years ago

It is ok in the sense that the Grid and Mesh can represent something with lon_bnds > 180 or -180. However, if you have something with bounds > 180 and <-180 and it overlaps itself then you may get strange regridding results. E.g. a frac>1.0 in conservative regrid for that area.

On Mar 18, 2022, at 9:46 AM, mvertens @.***> wrote:

@wwieder https://github.com/wwieder - thanks for this. I don't think we need the separate lookup table file. What I am planning to do is use your original data file but use the new grid file you generated instead of the mesh file. @oehmke https://github.com/oehmke and @rsdunlapiv https://github.com/rsdunlapiv - is it okay to have lon_bnds > 180 and < -180.

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1303#issuecomment-1072536603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U4HI26TYEILQXJ4DSDVASQMXANCNFSM4ZKQKIKQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

mvertens commented 2 years ago

@oehmke - thanks for clarifying. @wwieder - it sounds like we don't want this to be the case. I'll start with using the grid file you generated just to see if I can get this working - but I think we will need a new grid file without the lon_bnds > 180 and < -180. Do you agree?

wwieder commented 2 years ago

OK, so instead of a values if lon_bnds = 180.1 I should modify this to lon_bnds = -179.9 Will this end up with a correct grid, mesh and weighting @oehmke ?

wwieder commented 2 years ago

Assuming the suggestion above is what's needed, I've modified this file @mvertens so that all 180>= lon_bnds >= -180

updated file is here: /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid.nc

oehmke commented 2 years ago

Is this a global grid? If so, you want the east points to be the exactly the same as the west points along the edge, so there isn’t a gap.

On Mar 18, 2022, at 10:31 AM, will wieder @.***> wrote:

Assuming the suggestion above is what's needed, I've modified this file @mvertens https://github.com/mvertens so that all 180>= lon_bnds >= -180

updated file is here: /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid.nc

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1303#issuecomment-1072581557, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U5XQFAX3ZWEHWYFKYLVASVVDANCNFSM4ZKQKIKQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

mvertens commented 2 years ago

@wwieder - there does seem to be a gap in that lon(1) = -179.9958333 and lon(43201) = 180.00416522 If you could regenerate this dataset with no gaps between the east and west points - I'll give it a try.

wwieder commented 2 years ago

@mvertens I thin these are nearly identical points, at least to the 5th decimal point 180.00416522 - 360 = -179.99583477334994

Is this value >180 E causing issues? I'll try keeping all lon values (not just their bounds) < 180 and >-180.

I also realized that the previous bounds I was providing were no the mid point between each grid. This has also been corrected.

Now the first and last values for lon and lon_bnds are as follows lon [0] = -179.99583333335, [-1] = -179.99583477334994

lon_bnds [0,0] = 180.0, [-1,0] = 179.99999856000005

These obviously aren't identical, but are they close enough now?

Sep

wwieder commented 2 years ago

One potential issue here is that now the lon array is not monotonically increasing any more. Is this going to be a problem? based on the spacing between lons, there should be 43200 values in the longitude array, so the last one seems to be unnecessary or a duplicate (currently len(lon) = 43201).

All mapping unit values for the last lon coordinate in the original dataset are NaNs, making me think we should remove the last element of this coordinate @mvertens?

Thus, I removed what seems to be an extra value in the lon array. Now the left bound of lon[0] is (nearly) the same as the right bound of lon[-1], as @oehmke suggested.

lon [0] = -179.99583333335, [-1] = 179.99583189335004

lon_bnds [0,0] = 180.0, [-1,1] = 179.99999856000005

spacing between lons = 0.00833333330001551

wwieder commented 2 years ago

@mvertens let me know if this is what you need.

rsdunlapiv commented 2 years ago

@rokuingh please review this thread and see if you can help with defining the lat/lot bounds. We could also use some detail in the reference manual around how to define these bounds to ensure periodic dimensions are closed (no gap), including how many total points are expected. Please take a look and see if you can add a small section to the ref doc on how to define the lat/lon bounds. This could include info on tolerances as well -- when are two points considered the same?

rokuingh commented 2 years ago

@rsdunlap, will do. @wwieder I believe the tolerance for point matching is 1e-8, but I will inspect the code more closely to verify that.

mvertens commented 2 years ago

@wwieder - thanks. I'll give this a try today.

mvertens commented 2 years ago

@wwider - there is a problem ingesting /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid.nc since it is missing required attributes. The required format is the following:

netcdf single_tile_grid {
dimensions:
    time = 1 ;
    bound = 2 ;
    lat = 181 ;
    lon = 360 ;
variables:
    double lat(lat) ;
        lat:bounds = "lat_bnds" ;
        lat:units = "degrees_north" ;
        lat:long_name = "latitude" ;
        lat:standard_name = "latitude" ;
    double lat_bnds(lat, bound) ;
    double lon(lon) ;
        lon:bounds = "lon_bnds" ;
        lon:long_name = "longitude" ;
        lon:standard_name = "longitude" ;
        lon:units = "degrees_east" ;
    double lon_bnds(lon, bound) ;
    float so(time, lat, lon) ;
        so:standard_name = "sea_water_salinity" ;
        so:units = "psu" ;
        so:missing_value = 1.e+20f ;
}

The key point is that you need to set the 'units' attribute: "The cell center coordinate variables are determined by the value of its attribute units. The longitude variable has the attribute value set to either degrees_east, degree_east, degrees_E, degree_E, degreesE or degreeE. The latitude variable has the attribute value set to degrees_north, degree_north, degrees_N, degree_N, degreesN or degreeN."

wwieder commented 2 years ago

Sorry @mvertens the attributes are now on the metadata for lat and lon.

let me know of lat_ and lon_bnds need metadata too.

mvertens commented 2 years ago

@wwieder - thanks. Yes lat and lon needs to have meta data also. I added them in my working copy - but I'd like to use only your version if I commit something. The following seems to get me past the read: variables: double MU(lat, lon) ; MU:_FillValue = -32768. ; MU:scale_factor = 1. ; MU:add_offset = 0. ; MU:long_name = "WISE30sec MapUnit" ; double lon(lon) ; lon:_FillValue = NaN ; lon:units = "degrees_east" ; lon:bounds = "lon_bnds" ; double lat(lat) ; lat:_FillValue = NaN ; lat:units = "degrees_north" ; lat:bounds = "lat_bnds" ; int64 bound(bound) ; double lon_bnds(lon, bound) ; lon_bnds:_FillValue = NaN ; double lat_bnds(lat, bound) ; lat_bnds:_FillValue = NaN ;

mvertens commented 2 years ago

@wwieder - I am making progress on reading in the mapunit grid file and generating a route handle. But its very computationally expensive. I've spoke to @dlawrenncar about a plan to have a coarser mapunit dataset (say at 5 minutes) that we can use for coarser resolution output) - and I am trying to generate that now. I had to make several modifications to the grid dataset (including making MU an integer field) in order to get the mask set correctly). I have two questions:

  1. However, I am confused by two things in the lookup table. /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_lookup.nc It specifies the maximum number of mapunits to be 16413. But in the MU dataset the mapunits got from 79 -> 32056. So once I generate a new interpolated mapunit file - how can I reconcile values when I try to access the lookup table for mapunits that are greater than 16413.
  2. In the previous soil texture data PCT_SAND and PCT_CLAY were two dimensional variables (number_of_layers, max_value_mapunit) - but in the new lookup table these are three dimensional variables (MapUnit, SCID, soil_layer), where SCID is Number of soil unit within the given map unit. I'm not sure what output should be generated from this in terms of the surface dataset. Will PCT_SAND and PCT_CLAY now be 4d variables in the generated surface? Currently they are 3d variables (lsmlon, lsmlat, nlevsoi)
mvertens commented 2 years ago

@wwieder, @dlawren, @olyson,@slevis, @ekluzek - I think the mapping of mapunits for the ultra high resolution data new dataset from @wwieder is working correctly.

Please have a look at the following files:

mvertens commented 2 years ago

@wwieder - to take the next step in generating PCT_SAND and PCT_CLAY I'll need the answers to my questions above clarified.

wwieder commented 2 years ago

@mvertens thanks for working on this, I'm glad you were able to generate the mesh.

For your questions:

  1. I understood the maximum number of mapunits to be length of the MU array, not the maximum value of integers within that array. From what I can tell, both the gridded distribution of map units and the corresponding lookup table have the same length and max/min values. I also looks like the mapunit values are not necessarily continuous.
  2. As noted above (there's lots noted above in this issue) , the WISE dataset has an additional dimension SCID in the lookup table for different soil types that are within each mapping unit. The first index of SCID corresponds to the dominant, or most common, soil profile within each mapping unit, so I suggest we take this one. Hopefully this doesn't add too much more complexity to @mvertens's code.

Does this help you move forward?

mvertens commented 2 years ago

@wwieder - thanks that's very helpful. @slevisconsulting and I talked about this today and I think we have a plan as to how to move forwards with this.

wwieder commented 2 years ago

@mvertens new files for you to look at are here /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_lookup.nc /glade/scratch/wwieder/wise_30sec_v1/WISE30sec/Interchangeable_format/wise_30sec_v1_grid.nc

In these new files:

The documentation on the raw data states that in the lookup table a number of values may indicate no soil (text copied below). This results in a bunch of missing values in the new lookup table.

Hopefully this is OK, but the tricky part is still in how we handle this in the vertical. Specifically in the raw data, when lower soil horizons hit regolith (rocky subsoils), these layers get -7, now _fillValues in the lookup table. I need to work on an efficient way of handling this on my side, OR is it simple enough to extend the bottom 'good' soil property into missing values? From a data provenance perspective it seems slightly better to handle this in the regridding script, but if this is overly complicated I'll deal with it on my side.

Besides this outstanding issue about how to handle missing data in the vertical, I wondered if you can efficiently assess if we're handling everything else correctly?

Here's the note from Appendix 7 of the dataset documentation. "A -8 indicates that no meaningful substitution was possible for the specified ‘soil unit/climate’ cluster and attribute using TTR based on the present selection of soil profiles, -1 is used for Oceans and inland waters, -2 for Glaciers and snow caps, -3 for rock outcrops (resp. -7 for ‘rocky’ subsoils as for Leptosols), -4 for Dunes/Shifting sands, -5 for Salt flats, and -9 for all remaining miscellaneous units mainly to facilitate visualization using GIS.

mvertens commented 2 years ago

@wwieder - thanks for the new file. I think the simplest thing to do in the new surface dataset generation is wherever a negative value is encountered to extend it to the bottom 'good' soil as you mentioned. I don't think that is difficult to do.

wwieder commented 2 years ago

Sounds good @mvertens. To clarify, are you suggesting that the surface dataset code will handle the 'fill missing values at depth' issue?

mvertens commented 2 years ago

@wwieder - yes. What I am proposing is that for all negative values the first good soil layer (i.e. non-zero) value is used for all subsequent layers in that column. Does that make sense?

wwieder commented 2 years ago

Yes, I would say that all negative values inherit the last good (non-zero) value.

e.g. 26,23,21, 5,-9,-9,-9 becomes 26,23,21, 5, 5, 5, 5.

There may be other cases where the most common soil mapping unit is all missing values. In this case, I suggest we select the next most common mapping unit that contains non-missing values

mvertens commented 2 years ago

@wwieder - thanks. I have not put in the capability to determine the next most common mapping unit. That would take some work. We can add that as an issue.