GCM pipeline - NetCDF formatting of output

lindsayplatt commented 2 years ago

An initial implementation of NetCDF output was created, but there are a few concerns that are causing us to rethink the approach.

Built from scratch and so we are deviating from established standards, which means we would need to edit when we go release. Instead, we should use functions from ncdfgeom to write out the NetCDF file for timeseries spatial data (example use of these functions here in lakesurf-data-release).
Looping through each variable and each cell to populate the data in the NetCDF file (see this commit only on Lindsay's fork here), will be slow and isn't necessary.

We also need to consider how to extract the information, as well, since we want to create individual files for each GCM, time period, and cell to use in our lake-temperature-process-models repo (this splitting will happen over in that pipeline).

Going into this process, we have feather files for each tile (groups of cells) that look like this:

> read_feather("7_drivers_munge/tmp/7_GCM_ACCESS_1980_1999_tile37_daily.feather")
# A tibble: 14 x 9
   time        cell Shortwave Longwave AirTemp  RelHum       Rain       Snow WindSpeed
   <date>     <dbl>     <dbl>    <dbl>   <dbl>   <dbl>      <dbl>      <dbl>     <dbl>
 1 1980-01-02  3996      57.1     57.1  -5.24  0.00231 0          0               3.57
 2 1980-01-03  3996      48.1     48.1  -6.73  0.00182 0          0               2.98
 3 1980-01-04  3996      45.6     45.6  -2.45  0.00296 0          0               3.64
 4 1980-01-05  3996      40.3     40.3   0.437 0.00375 0.00000625 0               2.64
 5 1980-01-06  3996      68.6     68.6  -5.09  0.00258 0.0000675  0.000454        4.53
 6 1980-01-07  3996      91.9     91.9 -12.9   0.00110 0          0.00000417      2.58
 7 1980-01-08  3996      88.7     88.7  -7.65  0.00174 0          0               4.65
 8 1980-01-09  3996      92.2     92.2  -2.84  0.00240 0          0               3.24
 9 1980-01-10  3996      80.4     80.4  -1.87  0.00245 0          0               2.28
10 1980-01-11  3996      89.6     89.6  -1.23  0.00267 0          0               2.47
11 1980-01-12  3996      41.4     41.4   0.613 0.00393 0.00000542 0.00176         3.32
12 1980-01-13  3996      44.1     44.1  -0.805 0.00354 0          0.0000458       3.60
13 1980-01-14  3996      54.2     54.2  -2.90  0.00291 0          0               2.45
14 1980-01-15  3996      28.9     28.9  -2.52  0.00310 0.00000917 0.000925        3.00

We want a NetCDF file with the cell id, lat/lon, time, and variables (Shortwave, Longwave, AirTemp, RelHum, Rain, Snow, WindSpeed) represented in the dimensions. We will want 6 total NetCDF files - one for each of the GCMs. I really struggle with visualizing NetCDF files, but I think we need sets of cubes for each variable? The trick is that we want to be able to subset the NetCDF by cell and by time, but never by variable because we need them all. So separating them in that way doesn't seem as helpful.

Here is how I am picturing the NetCDF being structured with how I know how to write it now.

And here is what I want to use the data inside. Seeing it like this is making me think that maybe the format of how it is stored to change to match how we want to access it? Not sure.

lindsayplatt commented 2 years ago

I explored the NetCDF file that was created by the example code shared here and tried to mock up an example of how I want to extract data. For every cell, I need to write out a single file with a subset in the time dimension (since we will have 3 non-continuous time periods represented). If every call to ncvar_get() is a bit expensive, I am concerned about this approach. So, is there a different way to explore the data or a different way we should pull it out and then split?

# Explore Jared & Jordan's nc file, 02_weather_N40-53_W67-98.nc
# From https://www.sciencebase.gov/catalog/item/60341c3ed34eb12031172aa6

library(ncdf4)
nc_file <- '02_weather_N40-53_W67-98.nc'
nc <- nc_open(nc_file)
all_vars <- names(nc$var)
# Remove lat/lon
all_vars <- all_vars[!grepl("lat|lon", all_vars)]

weather_id_info <- ncatt_get(nc, "weather_id")
weather_id <- ncvar_get(nc, "weather_id")

# I don't understand this dimension
# Plus it fails
weather_id_char_info <- ncatt_get(nc, "weather_id_char")
weather_id_char <- ncvar_get(nc, "weather_id_char")

time_info <- ncatt_get(nc, "time")
time <- ncvar_get(nc, "time")
date <- as.Date(gsub("days since ", "", time_info$units)) + time

# For one cell, pull all the variables and a subset of times
pick_out_one_cell <- "nldas_x398_y158"
cell_i <- which(weather_id == pick_out_one_cell)
current_time_range_i <- which(date >= as.Date('1990-01-01') & date < as.Date('2000-01-01'))

data_one_cell <- purrr::map_dfc(all_vars, function(var) {
  ncvar_get(nc, var, 
            start = c(cell_i, current_time_range_i[1]),
            count = c(1, length(current_time_range_i)))
  }) %>% setNames(all_vars) %>% 
  mutate(time = date[current_time_range_i]) %>% 
  select(time, everything())

nc_close(nc)

head(data_one_cell)

# A tibble: 6 x 6
  time       dlwrfsfc dswrfsfc tmp2m ugrd10m vgrd10m
  <date>        <dbl>    <dbl> <dbl>   <dbl>   <dbl>
1 1990-01-01     308.    102.   275. -0.502   -0.403
2 1990-01-02     309.     99.3  275. -0.407   -0.481
3 1990-01-03     312.     95.6  276. -0.386   -0.555
4 1990-01-04     315.     84.8  276. -0.244   -0.567
5 1990-01-05     309.     86.2  275. -0.140   -0.490
6 1990-01-06     306.     86.8  274. -0.0762  -0.554

jordansread commented 2 years ago

Two thoughts:

it may be useful to look at the EA-LSTM data release's readme.md on different ways to access the data when it is that NetCDF format.
These downscaled-GCM cells are larger than NLDAS and my initial plots from the earlier PR review when I loaded up all WI lakes to see how it scaled made it appear that nearly all cells have one or more lakes in them (excluding the area of the great lakes). This could suggest that we're doing something close to copying and re-assembling the dataset when we pull it down. update realized that the current build samples from centroids_sf, which is all lakes (not just all lakes with depth), so the coverage of lakes may be more sparse, and definitely will be more sparse for other states in the footprint, like Iowa.

See (this dataset) vs NLDAS (from here and see image above to compare to a full copy)

For point 2: The "discrete sampling geometry" (NetCDF DSG) layout of the NetCDF file that we've used elsewhere (the EA-LSTM release and the forecasting project) is meant to cover situations where the locations of the data are irregular (e.g., reach sites, lake sites, stations) or it is a sparse subset of a uniform grid (e.g., the EA-LSTM's "weather" data, which only has data for cells that contain lakes, which is a < 40%(?) subset of the total dataset). It makes sense to use this approach if we want a standard way of presenting the data regardless of grid source - for example, having the same format for the NLDAS data as the downscaled GCM data. It also makes sense to use this approach because I think it makes it easier to apply to a different GDP-available dataset in the future. But I think you'll want to look carefully here at the formats. If you use the NetCDF DSG format, you may not end up reducing the size that much compared to the cube subset of the native dataset itself and perhaps DSG makes the format a tiny bit harder to use. And if you don't use the DSG format (e.g., you use the cube-like layout in the comment above), you may not want to alter the host dataset very much beyond downloading the spatial subset you want (OPeNDAP subset algorithm), creating new variable(s) from maths of existing variable(s), and perhaps combining time periods so each GCM has one file. I wanted to point this out because if a lot of the current challenge comes from the native dataset going to text, then being used to build a NetCDF file that may closely resemble the original file, you could consider alternatives.

lindsayplatt commented 2 years ago

This could suggest that we're doing something close to copying and re-assembling the dataset when we pull it down

I think I am a bit stuck after reading this because we did reconstruct the GCM grid exactly and then match lakes to the gridcells to use as our query polygons. So there may not be as many holes within the states but it would only be pulling data from cells inside of our footprint. I don't think I see much of an advantage to making the grid cells smaller than the GCM gridcells themselves - wouldn't the smaller cells just contain multiple copies of that same data anyways?

I had never heard of "discrete sampling geometry" but that does sound like the format we should use. Right now, we were populating the enter cube-like grid with NAs for cells that weren't part of our query. I think it makes more sense to only include the ones that did have data returned.

jordansread commented 2 years ago

I think I wasn't as clear on the cell topic - I'm not advocating for resampling the dataset to have a smaller cell size (agreed with you this would replicate values and not provide new information). Instead, I was asking the question or making the observation that if the final dataset is, say, a high percentage of the cells from the original NetCDF that is hosted on THREDDS, you may want to question rebuilding it locally as opposed to downloading the cube from THREDDS and using that as your NetCDF file.

I'm going to take a shot at illustrating this with a hypothetical but unrealistic project footprint using some graphics we have on hand (vs cooking up something more accurate). If the red box is our final project boundary (it isn't, our real boundary is much larger) and the blue cells represent cells with our lakes, these two cases below cover the situation when re-building the dataset may not make sense (top one) and the other case where it does make sense (bottom one).

In the top case, we're only saving ~5% of cells and it is a lot easier to download the subset of the original file and use that instead (a la OPeNDAP subset, which allows us to subset the much larger dataset in space/time). In the bottom case, the original format doesn't make sense because it would mostly contain information we're not using (grey cells) that we'd either populate with NAs or leave in the original data. In that case, using DSG would save a ton on file size and make the data release a bit easier to use.

lindsayplatt commented 2 years ago

Talked with Jordan on the phone.

Outcome: we will move forward with the GCM pipeline in this repo to get raw downscaled GCM data from GDP and munge it into useable GLM variables, saved as NetCDF DSG files for each of the GCMs.

Approach: continue on with current download approach and focus attention on getting the munged, GLM-ready values into the NetCDF DSG file to be picked up by the lake-temperature-process-models pipeline. Accept the potential slowness for now but revisit speeding this process up when we go to scale up beyond our 5 test sites (captured ideas in #258)

hcorson-dosch-usgs commented 2 years ago

Okay I've been digging into this issue on Friday and today, and I'm running into a snag. I've looped Lindsay in, and she and I both keep getting the same error that we can't seem to get around.

I referenced your code, @jread-usgs - here, where you prep the data/info for and then call ncdfgeom::write_timeseries_dsg()

Our situation is a little bit simpler in that we're reading the data to be prepped from a .feather file rather than a netCDF, but nonetheless I've very carefully reviewed the function parameter descriptions for write_timeseries_dsg(), to make sure that all the prepared info/data is the right length/dimensions.

If you try to build the gcm_nc target with tar_make(gcm_nc), you get this error:

Error : length(count) == ndims is not TRUE

To get a bit more detail you can drop a browser call here, build the target with tar_make(gcm_nc, callr_function=NULL) and then manually make the call to write_timeseries_dsg() specified in the code within the browser call, which gets the following error:

Error in var.put.nc(nc, pkg.env$dsg_timeseries_id, instance_names) : 
  length(count) == ndims is not TRUE

That seems to point to this line, but it's in reference to the global variable dsg_timeseries_id, which is set here, and isn't set by us. Note that the current version of the write_timeseries_dsg() function (and possibly all previous versions? Lindsay went digging and found this commit where it was added, but it was never an argument) doesn't accept instance_dim_name or dsg_timeseries_id as arguments, even though you specified them as arguments here.

Any idea what might be the issue, @jread-usgs ?

_Note: if you wanted to try to run this yourself, you'll need to drop these files in the 7_drivers_munge/out_skipnc folder. I've got a hacky workaround going at the moment to reconstruct the gcm_data_daily_feather file targets, since I didn't want to repeat Lindsay's work to download that data from GDP in order to build the files myself_

jordansread commented 2 years ago

Sorry! My fork of that pkg includes those args for flexibility https://github.com/jread-usgs/ncdfgeom

jordansread commented 2 years ago

I will pull this down, use that breakpoint and see if I can help.

hcorson-dosch-usgs commented 2 years ago

Oh quick update -- I just got a file to write by changing the type of instance_names to a character vector of names, even though it says a numeric or character vector is fine. I went back to where the dsg_timeseries_id is added as a variable. There, it's type is hard-coded to "NC_CHAR", so I thought I'd try passing instance_names as a character vector.

hcorson-dosch-usgs commented 2 years ago

Will check the file now

hcorson-dosch-usgs commented 2 years ago

jordansread commented 2 years ago

Lovely file!

hcorson-dosch-usgs commented 2 years ago

Thanks! Any ideas @jread-usgs about how we can get the grid crs information and parameters into the final netCDF while using this write_timeseries_dsg() function? Based on the function code, you can set the long_name for the lats and lons vectors, but the standard names and units for those dimensions are fixed, to 'latitude' and 'degrees north'/'degrees east'. What those values really represent are the x and y coordinates in the projected coordinate space of the grid cell centroids. And the units are meters, not degrees. We could maybe add all of the grid_params information as global attributes? Wouldn't fix the units and naming of the lon and lat coordinates, but at least the info would be there. I'll try that now.

hcorson-dosch-usgs commented 2 years ago

That would look like so:

jordansread commented 2 years ago

see ncdfgeom for the recent merge of the more flexible arguments we're using here.

I'd also suggest raising an issue there since the docs imply numbers can be used for instance name.

hcorson-dosch-usgs commented 2 years ago

Okay, great. That definitely improves the file. Still not sure what would be best for the spatial info though. I've added the crs info and grid params to the general attributes section, but the lat and lon vectors (populated with the projected coordinates, in meters, for the grid cell centroids) still have attribute info that isn't correct. I can control what's populated for long_name (circled in red), but the units and standard_name values are hard coded, so couldn't be changed w/o a change to the package itself.

I'll get started on that issue now -- I want to see if I can track down exactly which line triggers the error if the instance_names vector isn't a character vector.

lindsayplatt commented 2 years ago

the units and standard_name values are hard coded

Do we need to convert our lat/long to decimal lat/long instead? To match the the standard?

jordansread commented 2 years ago

We may want to ask Blodgett for input on this. I think some of the hard-coded elements are to make this conform to the NetCDF CF (climate and forecasting) standard. Perhaps the lat/lon is part of this, perhaps not. In my recent PR to that package, for example, I exposed a couple of hard-coded variables so we could set other values for them.

hcorson-dosch-usgs commented 2 years ago

Okay issue submitted

lindsayplatt commented 2 years ago

I think we can push on for now since ultimately, we want to get GLM modeled projections into the hands of our collaborators and this NetCDF file is an intermediate file. Perhaps, the issue will be resolved before we are packaging up our GLM results.

Is there anything else not working with this NetCDF? Or are those hard-coded elements the only sticking point?

lindsayplatt commented 2 years ago

@hcorson-dosch I see your issue to ncdfgeom about the docs and how instance_names only works when you input a character vector :tada: I think we wanted to submit a second issue about the hard-coded elements units and standard_name to understand what we can do so that those match our data.

jordansread commented 2 years ago

But note that the units of lat/lon may be hardcoded for a reason - they are defined in the standard/convention: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#latitude-coordinate

hcorson-dosch-usgs commented 2 years ago

Nope everything's working with the NetCDF. I did look a bit into the hard-coded latitude and longitude attributes (units and standard_name). Those are in line with NetCDF CF conventions. However, there are additional NetCDF CF convention variable standard names that we could use, namely projection_x_coordinate and projection_y_coordinate, which were used in the original Notaro netCDFs that we were working with initially, and which have canonical units of meters, like our projected coordinates. The write_timeseries_dsg() function doesn't currently have a mechanism for adding variables with these standard names, but I did some testing yesterday evening, and I can add them to the netCDF created by write_timeseries_dsg() after the fact, as iy and jx variables. Currently those are showing up in the finished NetCDF as variables, rather than as coordinates, which is how the lat and lon variables appear, so I was going to do a bit more testing this morning to see if I can figure out why. Of course, with this approach, we'd either have duplicate values (with incorrect attribute info) in the lat and lon variables, or we'd need to convert the lat and lon values to decimal degrees. But that might make for a confusing file if we're also providing the crs and grid params as global attributes.

To only use jx and iy instead of lat and lon we'd need to modify the write_timeseries_dsg() function. And I'm not certain if that would conform to the standard -- I'll do some more digging.

Here's what that modified file looks like at the moment:

hcorson-dosch-usgs commented 2 years ago

From the NetCDF CF conventions (see section 5):

To geo-reference data horizontally with respect to the Earth, a grid mapping variable may be provided by the data variable, using the grid_mapping attribute. If the coordinate variables for a horizontal grid are not longitude and latitude, then a grid_mapping variable provides the information required to derive longitude and latitude values for each grid location. If no grid mapping variable is referenced by a data variable, then longitude and latitude coordinate values shall be supplied in addition to the required coordinates. For example, the Cartesian coordinates of a map projection may be supplied as coordinate variables and, in addition, two-dimensional latitude and longitude variables may be supplied via the coordinates attribute on a data variable. The use of the axis attribute with values X and Y is recommended for the coordinate variables (see Chapter 4, Coordinate Types).

lindsayplatt commented 2 years ago

OK, trying to interpret what that means. It doesn't sound like it is specifying a unit, just specifying that you need to be able to gather the information necessary in order to use the coordinates? So, units should be able to be different than the current hard-coded value? Is that what you are gleaning from that statement @hcorson-dosch ?

hcorson-dosch-usgs commented 2 years ago

Not exactly - we can't change the attributes of the latitude and longitude coordinate variables, as that would violate conventions. It sounds like we need to provide a separate set of coordinates, as separate variables. Working on adding a grid_mapping variable at the moment, and reading more about whether we'll need to provide DD lat/lon values in addition to the projected coordinate values. The conventions are a bit complex to parse.

lindsayplatt commented 2 years ago

OK, two things before we go too much further:

How much do we think these attributes will change how we pull data out of the file in the lake-temperature-process-models pipeline? Wondering if we can move ahead on that piece while we continue to work out the final attributes for this file.
Can we ping Blodgett soon to get some of these questions answered? He probably has explore these docs more than we have and it may be faster to just reach out.

hcorson-dosch-usgs commented 2 years ago

Agreed, but I think it makes sense to merge #276 before wrapping up this work enough to get started on pulling data out in lake-temperature-process-models, since the data I'm working with at the moment is incorrect and is for the wrong set of cells, which would mean implementing some workarounds when trying to pull it from the netCDFs in lake-temperature-process-models. Seems cleaner to just get this code working with the new data and targets first.

hcorson-dosch-usgs commented 2 years ago

Okay, so here's where I've landed for now:

Given that the 'lon' and 'lat' variables we're adding to the NetCDF via write_timeseries_dsg() are not actually decimal degree longitude/latitude values, but projected coordinates (and therefore have incorrect attributes), I'm taking the approach of providing a grid mapping variable that specifies the crs, and then using RNetCDF::var.rename.nc() to rename those lon/lat variables to jx/iy, with the correct attributes for projected coordinates. I think this is in line with NetCDF CF conventions, per the final sentence of this snippet from section 5.6, though the conventions are a bit convoluted/hard to parse, so I'm not yet 100% certain:

If no grid mapping variable is provided and the coordinate variables for a horizontal grid are not longitude and latitude, then it is required that the latitude and longitude coordinates are supplied via the coordinates attribute. Such coordinates may be provided in addition to the provision of a grid mapping variable, but that is not required.

So instead of having lon and lat as coordinates, with iy and jx added as variables:

We'd just have iy and jx as variables, add have the grid mapping specified for the variables:

I think that's an okay placeholder for now, and we can get Dave Blodgett's input for how we might want to refine this, particularly for when we construct the final NetCDF files for the data release.

lindsayplatt commented 2 years ago

Just had a discussion w Jordan & Hayley. Rather than go against current projections:

Use lat/long rather than projected coordinates (re-project the grid centroids)
Describe the projections for the original data and note the grid centroid accuracy in the SB metadata
Don't use elevation

hcorson-dosch-usgs commented 2 years ago

Alright I'm making these updates to the draft netCDF code in #281.

One question, @jread-usgs and @lindsayplatt -- do we want to include some or all of the GCM grid parameters (crs, cellsize, xmin, ymin, nx, ny) in the global attributes of the netCDF, or should we only provide that information in the SB metadata?

jordansread commented 2 years ago

I'd say you could keep them in right now, since they aren't dims or values, but probably a question for perhaps Blodgett before finalizing in a data release since my concern is that this may confuse users of the dataset and imply these data are in that CRS (wait, is that the right CRS for the meter grid, looks like it is lat/lon?) when they aren't.

But I'm thinking this decision isn't a blocker and could be tagged as a "come back to later".

hcorson-dosch-usgs commented 2 years ago

(wait, is that the right CRS for the meter grid, looks like it is lat/lon?)

Yup, that's the right crs, as defined by the grid_mapping variable in the biased GCM data. We're still waiting on the grid mapping attribute for the unbiased GCM data (#273). We want to confirm they are the same.

Okay I'll leave these attributes in the global attributes for now. I'll make an issue to finalize the netCDF format before the data release is completed, and note this there.

hcorson-dosch-usgs commented 2 years ago

Here's the grid_mapping attribute from the biased Notaro netCDF files, rcm_map:

hcorson-dosch-usgs commented 2 years ago

Note - for now, to reduce confusion, I did modify the long_name attribute (the non-standardized attribute) to note that the lat/lon coordinates we are providing are in WGS84:

hcorson-dosch-usgs commented 2 years ago

Closed by #281 . Format will be finalized before data release, per #291. Content may change as a result of #273. Compression code may change per #294.

DOI-USGS / lake-temperature-model-prep

GCM pipeline - NetCDF formatting of output #252