DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Add code to write munged GCM data to NetCDF files #281

Closed hcorson-dosch-usgs closed 2 years ago

hcorson-dosch-usgs commented 2 years ago

This code writes the munged GCM data for all tiles (and all cells associated with those tiles) into a single NetCDF per GCM. The NetCDF files are written using write_timeseries_dsg.

Marking this PR as draft for three reasons: 1) @lindsayplatt needs to test this to confirm it runs with the actual targets, as I had to use a placeholder target for the glm_ready_gcm_data_feather files and provide those files manually 2) We're not 100% certain that the current netCDF format conforms to NetCDF CF conventions - see discussion in #252 3) We are still investigating some issues with the grid projection, dates, and NA values -- see #273

hcorson-dosch-usgs commented 2 years ago

Current plan is to go ahead and get this reviewed and merged into the gcm_driver_data_munge_pipeline branch of of the canonical repo, despite the unresolved issues noted in #273, so that I can get a start on pulling the data from the resulting netCDF files for use in lake-temperature-process-models.

@jread-usgs -- as discussed I'm leaving in my work-around that uses manually provided feather files, rather than requiring the munged feather files generated using the raw downloaded feather files, which are only on Lindsay's local machine. You'll need to unzip this archive into '7_drivers_munge/tmp', which should then allow you to build the glm_ready_gcm_data_feather targets that you'll need to build the gcm_nc target.

hcorson-dosch-usgs commented 2 years ago

Also you'll see I edited the map_query code to produce the map used by @lindsayplatt here, since it seemed like a useful one.

jordansread commented 2 years ago

Cool. I'm able to run, but ran into this error

• built branch query_cells_centroids_list_by_tile_93a603de
• built pattern query_cells_centroids_list_by_tile
• start target query_map_png
Joining, by = "cell_no"
x error target query_map_png
• end pipeline
Error : object 'grid_tiles_sf' not found

When starting to build the .png. Is there a place where that target needs to be defined?

hcorson-dosch-usgs commented 2 years ago

Oh just a typo - didn't catch it b/c it was in my environment. Sorry. Will commit the fix now

jordansread commented 2 years ago

One comment I wanted to add here that could be discussed in the future is what form the final data take with respect to the different GCMs that are debiased/downscaled.

Currently, this PR has a separate file for each GCM, which is consistent with what we've done in the past (e.g., GCM-specific zip files in the Winslow release). But one advantage of NetCDF is that it would allow us to use additional dimensions that could cover things like different GCMs...So an alternative would be to have one .nc file that contains all the data, with a dimension for each GCM. This would make it easier to get data from one variable from all GCMs at the same time.

I'm asking this because I think the files aren't going to be brutally huge and often users of this (like us!) will want to access all downscaled data for a given cell vs go to different files for that. Perhaps something to discuss later down the road when we're making decisions about the data release structure.

hcorson-dosch-usgs commented 2 years ago

Okay added a stop() argument if compression == TRUE to note that the option is not yet fully supported

hcorson-dosch-usgs commented 2 years ago

Okay @lindsayplatt can you just confirm that this builds as is for you with the actual glm_ready_gcm_data_feather files associated with that original target?

hcorson-dosch-usgs commented 2 years ago

Thanks, Lindsay. That missing value is hardcoded in the write_timeseries_dsg() function.

And that is odd about the ranges of those variables -- I can do a check of the munged feathers.

lindsayplatt commented 2 years ago

Huh that is just a really interesting way to denote missing values. Nevermind then :) If the ranges in the feathers are the same, I'll log that issue to #273 and we can move this one along.

hcorson-dosch-usgs commented 2 years ago

Okay working through this with the munged feather files.

I do see NA values for all variables, but it looks like maybe some of the data variables use NA to denote NA values, while some use NaN? I wonder if that's not picked up in the conversion to netCDF? For example, if I leave out na.rm=TRUE in my calls within summarize(), which I did by accident the first time round, I see this:

image

Lindsay is digging into this now.

The number of NA values (across all GCMs) for each variable is 52560. For each variable, for each GCM, there are 87600 NA values. 87600 days / 365 days per year = 240 years of NA values. We know we have 60 years of data in total, so that means those 240 years of NA values could be attributed to the 4 cells with NA values. If I exclude those four cells c(17909, 18126, 18127, 18993) then I see 0 NA values for all variables, for all GCMs, so that tracks.

If I summarize the results across all gcms, and exclude NA values, I get this: image

Here's the whole tibble, broken down by GCM image

lindsayplatt commented 2 years ago

Comparing the munged feathers (Hayleys values above) for just GFDL with those from Winslow 2017 confirms that all the values in the munged feather files are within reason.

WINSLOW GFDL: image

This pipeline GFDL: image

Now to just figure out why the values in the NetCDF are different.

dblodgett-usgs commented 2 years ago

I you are using ppc compression for this, some change to the values is expected. I lost track of what all transformations you are making.

hcorson-dosch-usgs commented 2 years ago

Alright I think Lindsay and I got to the bottom of these discrepancies. We confirmed that in the munged feathers, all variables except Rain and Snow coded NA values with NaN, but Rain and Snow coded NA values with NA image

Once that data was written to a netCDF and then pulled, we were seeing weird max values for Rain and Snow, as Lindsay noted above: image

If I replace all NA values with NaN before writing to netCDF, then when I use Lindsay's code to pull those same values back out, the ranges are reasonable: image

And match the summary derived from the munged feathers: image

So looks like RNetCDF (used to write the data to netCDF within write_timeseries_dsg()) expects NaN values