Closed lindsayplatt closed 2 years ago
I explored the NetCDF file that was created by the example code shared here and tried to mock up an example of how I want to extract data. For every cell, I need to write out a single file with a subset in the time dimension (since we will have 3 non-continuous time periods represented). If every call to ncvar_get()
is a bit expensive, I am concerned about this approach. So, is there a different way to explore the data or a different way we should pull it out and then split?
# Explore Jared & Jordan's nc file, 02_weather_N40-53_W67-98.nc
# From https://www.sciencebase.gov/catalog/item/60341c3ed34eb12031172aa6
library(ncdf4)
nc_file <- '02_weather_N40-53_W67-98.nc'
nc <- nc_open(nc_file)
all_vars <- names(nc$var)
# Remove lat/lon
all_vars <- all_vars[!grepl("lat|lon", all_vars)]
weather_id_info <- ncatt_get(nc, "weather_id")
weather_id <- ncvar_get(nc, "weather_id")
# I don't understand this dimension
# Plus it fails
weather_id_char_info <- ncatt_get(nc, "weather_id_char")
weather_id_char <- ncvar_get(nc, "weather_id_char")
time_info <- ncatt_get(nc, "time")
time <- ncvar_get(nc, "time")
date <- as.Date(gsub("days since ", "", time_info$units)) + time
# For one cell, pull all the variables and a subset of times
pick_out_one_cell <- "nldas_x398_y158"
cell_i <- which(weather_id == pick_out_one_cell)
current_time_range_i <- which(date >= as.Date('1990-01-01') & date < as.Date('2000-01-01'))
data_one_cell <- purrr::map_dfc(all_vars, function(var) {
ncvar_get(nc, var,
start = c(cell_i, current_time_range_i[1]),
count = c(1, length(current_time_range_i)))
}) %>% setNames(all_vars) %>%
mutate(time = date[current_time_range_i]) %>%
select(time, everything())
nc_close(nc)
head(data_one_cell)
# A tibble: 6 x 6
time dlwrfsfc dswrfsfc tmp2m ugrd10m vgrd10m
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990-01-01 308. 102. 275. -0.502 -0.403
2 1990-01-02 309. 99.3 275. -0.407 -0.481
3 1990-01-03 312. 95.6 276. -0.386 -0.555
4 1990-01-04 315. 84.8 276. -0.244 -0.567
5 1990-01-05 309. 86.2 275. -0.140 -0.490
6 1990-01-06 306. 86.8 274. -0.0762 -0.554
Two thoughts:
centroids_sf
, which is all lakes (not just all lakes with depth), so the coverage of lakes may be more sparse, and definitely will be more sparse for other states in the footprint, like Iowa. See (this dataset) vs NLDAS (from here and see image above to compare to a full copy)
For point 2: The "discrete sampling geometry" (NetCDF DSG) layout of the NetCDF file that we've used elsewhere (the EA-LSTM release and the forecasting project) is meant to cover situations where the locations of the data are irregular (e.g., reach sites, lake sites, stations) or it is a sparse subset of a uniform grid (e.g., the EA-LSTM's "weather" data, which only has data for cells that contain lakes, which is a < 40%(?) subset of the total dataset). It makes sense to use this approach if we want a standard way of presenting the data regardless of grid source - for example, having the same format for the NLDAS data as the downscaled GCM data. It also makes sense to use this approach because I think it makes it easier to apply to a different GDP-available dataset in the future. But I think you'll want to look carefully here at the formats. If you use the NetCDF DSG format, you may not end up reducing the size that much compared to the cube subset of the native dataset itself and perhaps DSG makes the format a tiny bit harder to use. And if you don't use the DSG format (e.g., you use the cube-like layout in the comment above), you may not want to alter the host dataset very much beyond downloading the spatial subset you want (OPeNDAP subset algorithm), creating new variable(s) from maths of existing variable(s), and perhaps combining time periods so each GCM has one file. I wanted to point this out because if a lot of the current challenge comes from the native dataset going to text, then being used to build a NetCDF file that may closely resemble the original file, you could consider alternatives.
This could suggest that we're doing something close to copying and re-assembling the dataset when we pull it down
I think I am a bit stuck after reading this because we did reconstruct the GCM grid exactly and then match lakes to the gridcells to use as our query polygons. So there may not be as many holes within the states but it would only be pulling data from cells inside of our footprint. I don't think I see much of an advantage to making the grid cells smaller than the GCM gridcells themselves - wouldn't the smaller cells just contain multiple copies of that same data anyways?
I had never heard of "discrete sampling geometry" but that does sound like the format we should use. Right now, we were populating the enter cube-like grid with NAs for cells that weren't part of our query. I think it makes more sense to only include the ones that did have data returned.
I think I wasn't as clear on the cell topic - I'm not advocating for resampling the dataset to have a smaller cell size (agreed with you this would replicate values and not provide new information). Instead, I was asking the question or making the observation that if the final dataset is, say, a high percentage of the cells from the original NetCDF that is hosted on THREDDS, you may want to question rebuilding it locally as opposed to downloading the cube from THREDDS and using that as your NetCDF file.
I'm going to take a shot at illustrating this with a hypothetical but unrealistic project footprint using some graphics we have on hand (vs cooking up something more accurate). If the red box is our final project boundary (it isn't, our real boundary is much larger) and the blue cells represent cells with our lakes, these two cases below cover the situation when re-building the dataset may not make sense (top one) and the other case where it does make sense (bottom one).
In the top case, we're only saving ~5% of cells and it is a lot easier to download the subset of the original file and use that instead (a la OPeNDAP subset, which allows us to subset the much larger dataset in space/time). In the bottom case, the original format doesn't make sense because it would mostly contain information we're not using (grey cells) that we'd either populate with NAs or leave in the original data. In that case, using DSG would save a ton on file size and make the data release a bit easier to use.
Talked with Jordan on the phone.
Outcome: we will move forward with the GCM pipeline in this repo to get raw downscaled GCM data from GDP and munge it into useable GLM variables, saved as NetCDF DSG files for each of the GCMs.
Approach: continue on with current download approach and focus attention on getting the munged, GLM-ready values into the NetCDF DSG file to be picked up by the lake-temperature-process-models
pipeline. Accept the potential slowness for now but revisit speeding this process up when we go to scale up beyond our 5 test sites (captured ideas in #258)
Okay I've been digging into this issue on Friday and today, and I'm running into a snag. I've looped Lindsay in, and she and I both keep getting the same error that we can't seem to get around.
I referenced your code, @jread-usgs - here, where you prep the data/info for and then call ncdfgeom::write_timeseries_dsg()
Our situation is a little bit simpler in that we're reading the data to be prepped from a .feather
file rather than a netCDF, but nonetheless I've very carefully reviewed the function parameter descriptions for write_timeseries_dsg()
, to make sure that all the prepared info/data is the right length/dimensions.
If you try to build the gcm_nc
target with tar_make(gcm_nc)
, you get this error:
Error : length(count) == ndims is not TRUE
To get a bit more detail you can drop a browser call here, build the target with tar_make(gcm_nc, callr_function=NULL)
and then manually make the call to write_timeseries_dsg()
specified in the code within the browser call, which gets the following error:
Error in var.put.nc(nc, pkg.env$dsg_timeseries_id, instance_names) :
length(count) == ndims is not TRUE
That seems to point to this line, but it's in reference to the global variable dsg_timeseries_id
, which is set here, and isn't set by us. Note that the current version of the write_timeseries_dsg()
function (and possibly all previous versions? Lindsay went digging and found this commit where it was added, but it was never an argument) doesn't accept instance_dim_name
or dsg_timeseries_id
as arguments, even though you specified them as arguments here.
Any idea what might be the issue, @jread-usgs ?
_Note: if you wanted to try to run this yourself, you'll need to drop these files in the 7_drivers_munge/out_skipnc
folder. I've got a hacky workaround going at the moment to reconstruct the gcm_data_daily_feather
file targets, since I didn't want to repeat Lindsay's work to download that data from GDP in order to build the files myself_
Sorry! My fork of that pkg includes those args for flexibility https://github.com/jread-usgs/ncdfgeom
I will pull this down, use that breakpoint and see if I can help.
Oh quick update -- I just got a file to write by changing the type of instance_names
to a character vector of names, even though it says a numeric or character vector is fine. I went back to where the dsg_timeseries_id
is added as a variable. There, it's type is hard-coded to "NC_CHAR", so I thought I'd try passing instance_names
as a character vector.
Will check the file now
Lovely file!
Thanks! Any ideas @jread-usgs about how we can get the grid crs information and parameters into the final netCDF while using this write_timeseries_dsg()
function? Based on the function code, you can set the long_name for the lats
and lons
vectors, but the standard names and units for those dimensions are fixed, to 'latitude' and 'degrees north'/'degrees east'. What those values really represent are the x and y coordinates in the projected coordinate space of the grid cell centroids. And the units are meters, not degrees. We could maybe add all of the grid_params
information as global attributes? Wouldn't fix the units and naming of the lon and lat coordinates, but at least the info would be there. I'll try that now.
That would look like so:
see ncdfgeom for the recent merge of the more flexible arguments we're using here.
I'd also suggest raising an issue there since the docs imply numbers can be used for instance name.
Okay, great. That definitely improves the file. Still not sure what would be best for the spatial info though. I've added the crs info and grid params to the general attributes section, but the lat
and lon
vectors (populated with the projected coordinates, in meters, for the grid cell centroids) still have attribute info that isn't correct. I can control what's populated for long_name
(circled in red), but the units
and standard_name
values are hard coded, so couldn't be changed w/o a change to the package itself.
I'll get started on that issue now -- I want to see if I can track down exactly which line triggers the error if the instance_names
vector isn't a character vector.
the units and standard_name values are hard coded
Do we need to convert our lat/long to decimal lat/long instead? To match the the standard?
We may want to ask Blodgett for input on this. I think some of the hard-coded elements are to make this conform to the NetCDF CF (climate and forecasting) standard. Perhaps the lat/lon is part of this, perhaps not. In my recent PR to that package, for example, I exposed a couple of hard-coded variables so we could set other values for them.
Okay issue submitted
I think we can push on for now since ultimately, we want to get GLM modeled projections into the hands of our collaborators and this NetCDF file is an intermediate file. Perhaps, the issue will be resolved before we are packaging up our GLM results.
Is there anything else not working with this NetCDF? Or are those hard-coded elements the only sticking point?
@hcorson-dosch I see your issue to ncdfgeom
about the docs and how instance_names
only works when you input a character vector :tada: I think we wanted to submit a second issue about the hard-coded elements units
and standard_name
to understand what we can do so that those match our data.
But note that the units of lat/lon may be hardcoded for a reason - they are defined in the standard/convention: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#latitude-coordinate
Nope everything's working with the NetCDF. I did look a bit into the hard-coded latitude
and longitude
attributes (units
and standard_name
). Those are in line with NetCDF CF conventions. However, there are additional NetCDF CF convention variable standard names that we could use, namely projection_x_coordinate
and projection_y_coordinate
, which were used in the original Notaro netCDFs that we were working with initially, and which have canonical units of meters, like our projected coordinates. The write_timeseries_dsg()
function doesn't currently have a mechanism for adding variables with these standard names, but I did some testing yesterday evening, and I can add them to the netCDF created by write_timeseries_dsg()
after the fact, as iy
and jx
variables. Currently those are showing up in the finished NetCDF as variables, rather than as coordinates, which is how the lat
and lon
variables appear, so I was going to do a bit more testing this morning to see if I can figure out why. Of course, with this approach, we'd either have duplicate values (with incorrect attribute info) in the lat
and lon
variables, or we'd need to convert the lat
and lon
values to decimal degrees. But that might make for a confusing file if we're also providing the crs and grid params as global attributes.
To only use jx
and iy
instead of lat
and lon
we'd need to modify the write_timeseries_dsg()
function. And I'm not certain if that would conform to the standard -- I'll do some more digging.
Here's what that modified file looks like at the moment:
From the NetCDF CF conventions (see section 5):
To geo-reference data horizontally with respect to the Earth, a grid mapping variable may be provided by the data variable, using the grid_mapping attribute. If the coordinate variables for a horizontal grid are not longitude and latitude, then a grid_mapping variable provides the information required to derive longitude and latitude values for each grid location. If no grid mapping variable is referenced by a data variable, then longitude and latitude coordinate values shall be supplied in addition to the required coordinates. For example, the Cartesian coordinates of a map projection may be supplied as coordinate variables and, in addition, two-dimensional latitude and longitude variables may be supplied via the coordinates attribute on a data variable. The use of the axis attribute with values X and Y is recommended for the coordinate variables (see Chapter 4, Coordinate Types).
OK, trying to interpret what that means. It doesn't sound like it is specifying a unit, just specifying that you need to be able to gather the information necessary in order to use the coordinates? So, units
should be able to be different than the current hard-coded value? Is that what you are gleaning from that statement @hcorson-dosch ?
Not exactly - we can't change the attributes of the latitude
and longitude
coordinate variables, as that would violate conventions. It sounds like we need to provide a separate set of coordinates, as separate variables. Working on adding a grid_mapping variable at the moment, and reading more about whether we'll need to provide DD lat/lon values in addition to the projected coordinate values. The conventions are a bit complex to parse.
OK, two things before we go too much further:
lake-temperature-process-models
pipeline? Wondering if we can move ahead on that piece while we continue to work out the final attributes for this file.Agreed, but I think it makes sense to merge #276 before wrapping up this work enough to get started on pulling data out in lake-temperature-process-models
, since the data I'm working with at the moment is incorrect and is for the wrong set of cells, which would mean implementing some workarounds when trying to pull it from the netCDFs in lake-temperature-process-models
. Seems cleaner to just get this code working with the new data and targets first.
Okay, so here's where I've landed for now:
Given that the 'lon' and 'lat' variables we're adding to the NetCDF via write_timeseries_dsg()
are not actually decimal degree longitude/latitude values, but projected coordinates (and therefore have incorrect attributes), I'm taking the approach of providing a grid mapping variable that specifies the crs, and then using RNetCDF::var.rename.nc()
to rename those lon/lat variables to jx/iy, with the correct attributes for projected coordinates. I think this is in line with NetCDF CF conventions, per the final sentence of this snippet from section 5.6, though the conventions are a bit convoluted/hard to parse, so I'm not yet 100% certain:
If no grid mapping variable is provided and the coordinate variables for a horizontal grid are not longitude and latitude, then it is required that the latitude and longitude coordinates are supplied via the coordinates attribute. Such coordinates may be provided in addition to the provision of a grid mapping variable, but that is not required.
So instead of having lon
and lat
as coordinates, with iy
and jx
added as variables:
We'd just have iy
and jx
as variables, add have the grid mapping specified for the variables:
I think that's an okay placeholder for now, and we can get Dave Blodgett's input for how we might want to refine this, particularly for when we construct the final NetCDF files for the data release.
Just had a discussion w Jordan & Hayley. Rather than go against current projections:
Alright I'm making these updates to the draft netCDF code in #281.
One question, @jread-usgs and @lindsayplatt -- do we want to include some or all of the GCM grid parameters (crs, cellsize, xmin, ymin, nx, ny) in the global attributes of the netCDF, or should we only provide that information in the SB metadata?
I'd say you could keep them in right now, since they aren't dims or values, but probably a question for perhaps Blodgett before finalizing in a data release since my concern is that this may confuse users of the dataset and imply these data are in that CRS (wait, is that the right CRS for the meter grid, looks like it is lat/lon?) when they aren't.
But I'm thinking this decision isn't a blocker and could be tagged as a "come back to later".
(wait, is that the right CRS for the meter grid, looks like it is lat/lon?)
Yup, that's the right crs, as defined by the grid_mapping variable in the biased GCM data. We're still waiting on the grid mapping attribute for the unbiased GCM data (#273). We want to confirm they are the same.
Okay I'll leave these attributes in the global attributes for now. I'll make an issue to finalize the netCDF format before the data release is completed, and note this there.
Here's the grid_mapping attribute from the biased Notaro netCDF files, rcm_map
:
Note - for now, to reduce confusion, I did modify the long_name
attribute (the non-standardized attribute) to note that the lat/lon coordinates we are providing are in WGS84:
Closed by #281 . Format will be finalized before data release, per #291. Content may change as a result of #273. Compression code may change per #294.
An initial implementation of NetCDF output was created, but there are a few concerns that are causing us to rethink the approach.
We also need to consider how to extract the information, as well, since we want to create individual files for each GCM, time period, and cell to use in our lake-temperature-process-models repo (this splitting will happen over in that pipeline).
Going into this process, we have
feather
files for each tile (groups of cells) that look like this:We want a NetCDF file with the cell id, lat/lon, time, and variables (Shortwave, Longwave, AirTemp, RelHum, Rain, Snow, WindSpeed) represented in the dimensions. We will want 6 total NetCDF files - one for each of the GCMs. I really struggle with visualizing NetCDF files, but I think we need sets of cubes for each variable? The trick is that we want to be able to subset the NetCDF by cell and by time, but never by variable because we need them all. So separating them in that way doesn't seem as helpful.
Here is how I am picturing the NetCDF being structured with how I know how to write it now.
And here is what I want to use the data inside. Seeing it like this is making me think that maybe the format of how it is stored to change to match how we want to access it? Not sure.