Closed hcorson-dosch-usgs closed 2 years ago
Current plan is to go ahead and get this reviewed and merged into the gcm_driver_data_munge_pipeline
branch of of the canonical repo, despite the unresolved issues noted in #273, so that I can get a start on pulling the data from the resulting netCDF files for use in lake-temperature-process-models
.
@jread-usgs -- as discussed I'm leaving in my work-around that uses manually provided feather files, rather than requiring the munged feather files generated using the raw downloaded feather files, which are only on Lindsay's local machine. You'll need to unzip this archive into '7_drivers_munge/tmp'
, which should then allow you to build the glm_ready_gcm_data_feather
targets that you'll need to build the gcm_nc
target.
Also you'll see I edited the map_query code to produce the map used by @lindsayplatt here, since it seemed like a useful one.
Cool. I'm able to run, but ran into this error
• built branch query_cells_centroids_list_by_tile_93a603de
• built pattern query_cells_centroids_list_by_tile
• start target query_map_png
Joining, by = "cell_no"
x error target query_map_png
• end pipeline
Error : object 'grid_tiles_sf' not found
When starting to build the .png. Is there a place where that target needs to be defined?
Oh just a typo - didn't catch it b/c it was in my environment. Sorry. Will commit the fix now
One comment I wanted to add here that could be discussed in the future is what form the final data take with respect to the different GCMs that are debiased/downscaled.
Currently, this PR has a separate file for each GCM, which is consistent with what we've done in the past (e.g., GCM-specific zip files in the Winslow release). But one advantage of NetCDF is that it would allow us to use additional dimensions that could cover things like different GCMs...So an alternative would be to have one .nc file that contains all the data, with a dimension for each GCM. This would make it easier to get data from one variable from all GCMs at the same time.
I'm asking this because I think the files aren't going to be brutally huge and often users of this (like us!) will want to access all downscaled data for a given cell vs go to different files for that. Perhaps something to discuss later down the road when we're making decisions about the data release structure.
Okay added a stop()
argument if compression == TRUE
to note that the option is not yet fully supported
Okay @lindsayplatt can you just confirm that this builds as is for you with the actual glm_ready_gcm_data_feather
files associated with that original target?
Thanks, Lindsay. That missing value is hardcoded in the write_timeseries_dsg()
function.
And that is odd about the ranges of those variables -- I can do a check of the munged feathers.
Huh that is just a really interesting way to denote missing values. Nevermind then :) If the ranges in the feathers are the same, I'll log that issue to #273 and we can move this one along.
Okay working through this with the munged feather files.
I do see NA values for all variables, but it looks like maybe some of the data variables use NA to denote NA values, while some use NaN? I wonder if that's not picked up in the conversion to netCDF? For example, if I leave out na.rm=TRUE
in my calls within summarize()
, which I did by accident the first time round, I see this:
Lindsay is digging into this now.
The number of NA values (across all GCMs) for each variable is 52560. For each variable, for each GCM, there are 87600 NA values. 87600 days / 365 days per year = 240 years of NA values. We know we have 60 years of data in total, so that means those 240 years of NA values could be attributed to the 4 cells with NA values. If I exclude those four cells c(17909, 18126, 18127, 18993)
then I see 0 NA values for all variables, for all GCMs, so that tracks.
If I summarize the results across all gcms, and exclude NA values, I get this:
Here's the whole tibble, broken down by GCM
Comparing the munged feathers (Hayleys values above) for just GFDL with those from Winslow 2017 confirms that all the values in the munged feather files are within reason.
WINSLOW GFDL:
This pipeline GFDL:
Now to just figure out why the values in the NetCDF are different.
I you are using ppc compression for this, some change to the values is expected. I lost track of what all transformations you are making.
Alright I think Lindsay and I got to the bottom of these discrepancies. We confirmed that in the munged feathers, all variables except Rain and Snow coded NA values with NaN
, but Rain and Snow coded NA values with NA
Once that data was written to a netCDF and then pulled, we were seeing weird max values for Rain and Snow, as Lindsay noted above:
If I replace all NA
values with NaN
before writing to netCDF, then when I use Lindsay's code to pull those same values back out, the ranges are reasonable:
And match the summary derived from the munged feathers:
So looks like RNetCDF
(used to write the data to netCDF within write_timeseries_dsg()
) expects NaN
values
This code writes the munged GCM data for all tiles (and all cells associated with those tiles) into a single NetCDF per GCM. The NetCDF files are written using
write_timeseries_dsg
.Marking this PR as draft for three reasons: 1) @lindsayplatt needs to test this to confirm it runs with the actual targets, as I had to use a placeholder target for the
glm_ready_gcm_data_feather
files and provide those files manually 2) We're not 100% certain that the current netCDF format conforms to NetCDF CF conventions - see discussion in #252 3) We are still investigating some issues with the grid projection, dates, and NA values -- see #273