Add targets pipeline for GCM driver data munging

Create targets pipeline to munge GCM driver data from GDP using geoknife to a single (or few) NetCDF file(s). It should work with 2_crosswalk_munge/out/centroid_lakes_sf.rds.ind as the input for the spatial geom & download data for all 6 of the GCMs (map over these using data.frame of URLs & time periods). We will create a wrapper function that builds this targets pipeline and creates the single NetCDF file. The function will be called from the big scipiper pipeline.

Plan is to pilot this workflow with 5 lakes, then scale to all of MN, then to the whole footprint.

Considerations:

Chunking the lakes (1000 at a time is what has been used previously) is a good approach, but we may end up with duplicate data downloads if there are more than one lake in a single cell.
GDP might be too brittle (if so, may need to adopt the NLDAS approach, see this)
Notaro GCMs will need snow, so we should follow a previous approach (to be shared by Jordan later) where snow is created from rain when the air temp is below 0 deg C.
Could later add in NLDAS to this workflow (rather than where it currently in this pipeline), so that the driver data munging could be pulled out as their own process.

Hayley kicked this off with #223

@jread-usgs some questions for you now that @hcorson-dosch and I are really in the guts of implementing the pipeline.

Is the format of the data being returned correct?

My current working query (here is the target and here is the function) returns a messy text file that requires munging – appears to be multiple csv-formatted file separated by headers all in one text file. Does that seem right or is that triggering alarm bells? I have started an initial function for parsing but want to make sure before I go much further. Just seems really funky.

Here is a snippet of what is loaded when I do readLines():

[1] "# evspsbl"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
[2] ",ID1,ID10,ID100,ID11,ID12,ID13,ID14,ID15,ID16,ID17,ID18,ID19,ID2,ID20,ID21,ID22,ID23,ID24,ID25,ID26,ID27,ID28,ID29,ID3,ID30,ID31,ID32,ID33,ID34,ID35,ID36,ID37,ID38,ID39,ID4,ID40,ID41,ID42,ID43,ID44,ID45,ID46,ID47,ID48,ID49,ID5,ID50,ID51,ID52,ID53,ID54,ID55,ID56,ID57,ID58,ID59,ID6,ID60,ID61,ID62,ID63,ID64,ID65,ID66,ID67,ID68,ID69,ID7,ID70,ID71,ID72,ID73,ID74,ID75,ID76,ID77,ID78,ID79,ID8,ID80,ID81,ID82,ID83,ID84,ID85,ID86,ID87,ID88,ID89,ID9,ID90,ID91,ID92,ID93,ID94,ID95,ID96,ID97,ID98,ID99"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[3] "TIMESTEP,MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1),MEAN(kg m-2 s-1)"
[4] "1999-01-01T00:00:00Z,3.0479403E-6,6.2411355E-6,7.818796E-7,6.894544E-6,6.6264047E-6,9.675139E-6,1.0771746E-5,1.06765565E-5,9.926552E-6,9.447711E-6,8.660588E-6,7.797644E-6,3.4016837E-6,6.448473E-6,1.0036815E-5,9.920187E-6,1.1409653E-5,1.2325739E-5,1.1967138E-5,1.1022169E-5,1.0329134E-5,9.073887E-6,7.557237E-6,6.5496706E-6,6.2436175E-6,1.1667247E-5,7.667272E-6,1.1420567E-5,1.2252027E-5,9.922599E-6,8.150401E-6,1.0508522E-5,8.838302E-6,6.8397594E-6,9.156757E-6,5.3106673E-6,1.280317E-5,1.1962649E-5,1.3006744E-5,1.210488E-5,1.2293051E-5,1.2494677E-5,1.0339701E-5,7.629874E-6,6.2304475E-6,9.326908E-6,5.3670647E-6,1.1614002E-5,7.0229394E-6,8.721521E-6,1.1697765E-5,1.0465159E-5,1.0967345E-5,8.876884E-6,4.7448602E-6,3.0857123E-6,8.758055E-6,6.093673E-6,9.520644E-6,6.7338487E-6,7.648563E-6,8.90384E-6,1.0210867E-5,9.0875365E-6,6.840079E-6,4.2024503E-6,4.628265E-6,8.725951E-6,4.5186694E-6,7.3864794E-6,6.5283275E-6,5.5463806E-6,2.5749143E-6,5.3497924E-6,6.1883534E-6,4.4203553E-6,-1.3556412E-7,3.2933548E-7,8.003123E-6,5.900954E-7,4.469657E-6,5.2579603E-6,4.909238E-6,4.802666E-6,6.03942E-6,4.63963E-6,1.0305251E-6,-2.449164E-6,-1.5818581E-6,7.0171145E-6,6.270129E-7,7.2543207E-6,3.3003553E-6,5.93133E-6,4.379594E-6,4.930987E-6,2.7132191E-6,-8.9022785E-7,-2.9605308E-6,-1.6005098E-6"                                                                                                                                                                                                                                                                                                                                                                                                                                     
[5] "1999-01-01T01:00:00Z,2.5578656E-6,4.4968933E-6,-1.7777161E-6,8.026662E-6,5.8394744E-6,4.2759384E-6,5.3513254E-6,6.5573186E-6,8.520235E-6,9.267275E-6,8.664139E-6,7.1855493E-6,3.6754116E-7,4.7927642E-6,9.465936E-6,9.900281E-6,8.269337E-6,7.657092E-6,8.669214E-6,9.780522E-6,9.8940345E-6,9.253336E-6,7.82955E-6,1.6866378E-6,5.8803425E-6,8.67844E-6,6.2116706E-6,1.0424121E-5,7.599634E-6,7.2886546E-6,7.539618E-6,1.0400472E-5,8.60406E-6,6.959848E-6,4.164287E-6,5.3128806E-6,1.1033225E-5,9.527728E-6,1.2241214E-5,1.1091293E-5,1.1527273E-5,1.1818796E-5,1.1724405E-5,8.783328E-6,6.129934E-6,6.1801793E-6,4.030096E-6,8.766548E-6,3.168272E-6,7.775944E-6,6.9698535E-6,7.530194E-6,1.09866405E-5,1.1252795E-5,8.776212E-6,6.0230163E-6,8.2925835E-6,3.6850608E-6,4.8326674E-6,2.9843206E-6,5.1575917E-6,2.3220541E-6,6.0555444E-6,7.6092315E-6,7.044763E-6,7.388853E-6,4.9958044E-6,8.974322E-6,2.988957E-6,6.127966E-6,4.2845613E-6,3.2826106E-6,4.6146974E-6,-2.6444893E-6,-1.0700569E-6,6.735114E-6,4.568504E-6,3.0298354E-6,7.652875E-6,7.706925E-7,7.4221134E-6,4.630271E-6,3.698081E-7,3.1633144E-6,1.8592492E-6,3.999646E-6,4.6821456E-6,3.5261576E-6,1.5989144E-7,5.867402E-6,-1.777977E-6,6.601029E-6,2.5276915E-6,5.216548E-6,2.499017E-6,1.9578079E-6,3.2954124E-6,3.814628E-6,3.4222915E-6,1.0094535E-6"

What to use as query polygons?

Setting aside for the moment the fact that we’ll have to consider how best to chunk the geoknife queries, we started discussing how we want to pass in the spatial information for a single geoknife query. Each geoknife query requires that the geometries be polygons, not points, so we can’t use lake centroids. We talked through one approach for creating the polygon geometries to use in the query that would 1) take advantage of GDP processing power, while 2) avoid downloading duplicate data. This would be hybrid approach between using actual lake geometries for the query (too detailed) and using a bounding box that encapsulates a set of lakes (too coarse). What are you thoughts on this approach?

Construct GCM grid (everything below this assumes we have this info)
Make bounding boxes for every single lake
Filter GCM grid to only cells that intersect with lake bboxes.
Divide bounding boxes into two groups:
- GROUP 1: Ones that span multiple GCM grid cells --> these go directly to our query polygons for geoknife
  - This means that GDP can do the area weighting for these lakes, though, admittedly, the degree to which the lake bounding box overlaps the grid cells may not accurately reflect the degree to which the lake polygon itself overlaps those grid cells. It depends on the orientation and shape of the lake.
- GROUP 2: Ones that fit completely in a GCM grid cell --> add column to data.frame for each lake about which GCM grid cell they belong to. Then keep only the unique GCM grid cells and create a new polygon from that GCM grid cell. Then add these to our list of polygons to query with geoknife.
  - Note – our constructed grid cells would need to exactly match in order for GDP to only return the drivers for one cell

An alternative, simpler approach that would be akin to using lake centroids would be to map the lake centroids to our constructed GCM grid, and then use the grid polygons that contain lake centroids in our geoknife query. This would be less accurate for lakes that span multiple grid cells, but that may be an acceptable tradeoff, particularly if the area-weighting computation is expensive and/or doesn’t have much of an impact on the final computed set of drivers (not sure how much variation there is between a cell and its immediate neighbors).

Is the format of the data being returned correct?

Yes, that is the correct format coming from the GDP. But you can use build-in file parsers in geoknife to read that dataset into a data.frame. Instead of your code download(gcm_job, destination = out_file, overwrite = TRUE), you should be able to do:

my_data <- result(gcm_job)
# then write to file like this:

arrow::write_feather(my_data, out_file)

If using download and readLines, the data look like you've shown:

my_data <- geoknife(data.frame('point1'=c(-89, 46), 'point2'=c(-88.6, 45.2)), fabric = 'prism', wait = TRUE) %>% download(destination = tempfile()) %>% readLines(n = 5)
my_data
[1] "# ppt"                                  ",point1,point2"                         "TIMESTEP,MEAN(mm/month),MEAN(mm/month)"
[4] "1895-01-01T00:00:00Z,55.8425,47.74"     "1895-02-01T00:00:00Z,11.78,11.98"

result gives you something more immediately usable:

my_data <- geoknife(data.frame('point1'=c(-89, 46), 'point2'=c(-88.6, 45.2)), fabric = 'prism', wait = TRUE) %>% result()
my_data
     DateTime   point1 point2 variable statistic
1  1895-01-01  55.8425  47.74      ppt      MEAN
2  1895-02-01  11.7800  11.98      ppt      MEAN
3  1895-03-01   9.6625  10.08      ppt      MEAN
4  1895-04-01  39.6325  39.17      ppt      MEAN
5  1895-05-01 127.2650 118.65      ppt      MEAN
6  1895-06-01  82.4375  67.78      ppt      MEAN
7  1895-07-01  70.5825  63.30      ppt      MEAN
8  1895-08-01  58.0800  64.37      ppt      MEAN
9  1895-09-01 126.1900  86.83      ppt      MEAN
10 1895-10-01  16.9300  18.05      ppt      MEAN
11 1895-11-01  34.3000  34.58      ppt      MEAN
12 1895-12-01  46.1300  39.02      ppt      MEAN
13 1896-01-01  27.1700  25.38      ppt      MEAN

☝️ geoknife's parsers return wide format data.frames which are going to be a lot smaller than tall data.frames when there are lots of features. But you can pivot as needed.

What to use as query polygons?

Each geoknife query requires that the geometries be polygons, not points, so we can’t use lake centroids.

I'm not sure where this limit comes from, and you'll see a lot of point-based geoknife examples in the documentation. (e.g., ?geoknife, ?simplegeom) and the vignettes. Lat/lon pairs can be used, although there isn't a clean way to accept them as sf objects or something else right now, so you'd need to use the data.frame method if you want to use points.

Because the lake polygons are much more complex than the coarser grid cells (I think these climate data are somewhere around 0.25° and 0.35° lat/lon cells), we stopped trying to sample the climate data to the lake polygon, and instead simply started using the value of the grid cell that the lake centroid landed in. This is what is currently done with NLDAS. Using the centroid simplifies a lot of things and allows us to re-use the data from one cell for the multiple lakes that fall within it, as opposed to calculating each lake polygon's specific driver values in case there is a partial overlap with an adjacent cell.

For NLDAS, some of this is discussed and shown here

With that in mind, your comment:

An alternative, simpler approach that would be akin to using lake centroids would be to map the lake centroids to our constructed GCM grid, and then use the grid polygons that contain lake centroids in our geoknife query. This would be less accurate for lakes that span multiple grid cells, but that may be an acceptable tradeoff, particularly if the area-weighting computation is expensive and/or doesn’t have much of an impact on the final computed set of drivers (not sure how much variation there is between a cell and its immediate neighbors).

I'd say "yes" to this, the tradeoff is acceptable and also matches what we do elsewhere.

For your steps above, I'd recommend

Construct GCM grid (everything below this assumes we have this info) I think this is going to be trickier than NLDAS because the grid isn't a lat/lon grid already, but should be possible and might want to talk to David (who reformatted this original dataset for the GDP hosting) or Blodgett
Construct a large "job" or "task" grid that contains many GCM cells. The "job" grid will be the size of a geoknife query, and I'd suggest creating this as a number of simple tiles across the whole US (so some tiles will be empty and have no lakes, but these tile targets will therefore not have to change in the future if the footprint expands). These job tiles will also be the unit of bulk data that gets added to the final netcdf file after the geoknife jobs are complete, so its arrangement should support bulk writing of netcdf, but we can talk about that later and you should ignore this part of the comment for the early prototype tasks.
~Make bounding boxes for every single lake~ use centroids for every lake
Filter GCM grid to only cells that intersect with ~lake bboxes~ lake centroids
Use geoknife to call for all centroid cells, breaking the jobs up into the "tiles" in 2 above
Create a combiner target that is a single NetCDF file where all of these data get written to it

☝️ The idea with the "tiles" is that even when we expand the number of lakes we're modeling, you may not be adding a centroid that covers a new GCM cell, so the final .nc file would not rebuild in that case. But, if you are expanding the footprint or hit new cells within existing tiles, you'd only rebuild the tile-specific geoknife targets, followed by the final NetCDF file (which would need to be rebuilt anytime one of the tile targets changes), avoiding excessively greedy rebuilds in this case.

That issue with points vs polygons is something that I was struggling with when getting the query to work. It must have been a different part of my command. Although, I just investigated this more and I think it is more the conversion of my sf object to sp then to simplegeom(). It appears that simplegeom() doesn't like SpatialPoints? I think maybe I need to just make it into a data.frame as in the sample simplegeom(data.frame('point1'=c(-89, 46), 'point2'=c(-88.6, 45.2))).

library(sf)
library(geoknife)

centroids_sf <- structure(list(site_id = c("nhdhr_121623196", "nhdhr_143247099", 
"nhdhr_151949134", "nhdhr_157359786", "nhdhr_74926939"), Shape = structure(list(
    structure(c(-88.4467147238926, 42.7996962344843), class = c("XY", 
    "POINT", "sfg")), structure(c(-88.6432114598417, 43.6283635061587
    ), class = c("XY", "POINT", "sfg")), structure(c(-91.7772962678006, 
    45.6126497556779), class = c("XY", "POINT", "sfg")), structure(c(-90.290648760424, 
    44.2691685292945), class = c("XY", "POINT", "sfg")), structure(c(-91.5175899000678, 
    45.7450307015114), class = c("XY", "POINT", "sfg"))), class = c("sfc_POINT", 
"sfc"), precision = 0, bbox = structure(c(xmin = -91.7772962678006, 
ymin = 42.7996962344843, xmax = -88.4467147238926, ymax = 45.7450307015114
), class = "bbox"), crs = structure(list(input = "EPSG:4326", 
    wkt = "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"), class = "crs"), n_empty = 0L)), row.names = c(NA, 
-5L), sf_column = "Shape", agr = structure(c(site_id = NA_integer_), .Label = c("constant", 
"aggregate", "identity"), class = "factor"), class = c("sf", 
"tbl_df", "tbl", "data.frame"))

centroids_geoknifegeom <- simplegeom(as_Spatial(centroids_sf))

Error:

Error in as(.Object, "simplegeom") : 
  no method or default for coercing “SpatialPointsDataFrame” to “simplegeom”

Thanks for the advice on direction! I do have questions on the very last part. I am not understanding how

these tile targets will therefore not have to change in the future if the footprint expands

and

if you are expanding the footprint or hit new cells within existing tiles, you'd only rebuild the tile-specific geoknife targets

can both be true. Wouldn't the tile data already have built? Or is it that the target is prepped for doing the job, but we don't actually try to download if there is not a centroid in the tile.

Replying separately to these two comments

It appears that simplegeom() doesn't like SpatialPoints

Correct. geoknife does not have a method for spatialPoints. You need to use the data.frame method currently. Behind the scenes, GDP doesn't actually support points as the "feature" for queries, but geoknife is getting around that by creating a tiny diamond-shaped polygon around the point. See here

If we do future def on geoknife, we should probably support sf::st_sf classes, including multi-point collections with named features. But unlikely that will happen soon.

For now I think you know the way around this by creating lat/lon data.frames with names.

and

if you are expanding the footprint or hit new cells within existing tiles, you'd only rebuild the tile-specific geoknife targets

can both be true. Wouldn't the tile data already have built? Or is it that the target is prepped for doing the job, but we don't actually try to download if there is not a centroid in the tile.

Yes, I think you probably wouldn't branch a target for a given tile if it is empty. But perhaps my underlying point is that you don't want the tiles to adjust size or domain when you add a new lake. If they did rely on the domain (e.g., you laid out the tiles based on the bounding box of all lakes used, instead of fixed according to the extent of CONUS) you'd rebuild all tiles every time the footprint of all lakes changed, even if by a tiny amount or because the sf object's metadata itself changed.

@jread-usgs, re: laying out the tiles according to the extent of CONUS, the Dynamical Downscaling for the Midwest and Great Lakes Basin GCM data covers only the northeastern U.S. and a portion of the midwest. Here's our current working reconstructed grid of the GCM data:

Do we still want our 'tile' layout to cover all of CONUS, in case we switch driver data sources at some point? Or should we only be sure to cover the full extent of this GCM dataset?

Nice. Yes, I see that we really wouldn't need to have the tiles extend beyond the footprint of the dataset. I think I had a more general tile concept in mind that could apply to different underlying datasets while being able to use the target branching concepts. But that is probably unnecessary in this case.

For this grid, was it simpler than I thought to create the grid? I thought there was some funkiness because there is both a jx dimension to the data and a longitude value. It made me think the lat/lon spacing wasn't uniform.

@jread-usgs - no there was/is some funkiness to puzzle out. The iy and jx dimensions (both 1D arrays) represent the projected y and x coordinates, respectively, in kilometers. Those values (if multiplied by 1000 to get to units of meters) map out a square 25km x 25km cell-size grid in a custom lambert conformal conic projection (per the projection parameters included in the netcdf metadata). This matches the documentation, which indicates the grid has a 25km cell size. Because the grid was created to be square in this projected coordinate system (a square grid in a conical projection), the latitude and longitude vary along the lines of the drawn grid:

This means that if you reproject and map this same grid in an unprojected geographic coordinate system (e.g. WGS84), the center of the grid reaches higher latitude than the edges, and the grid appears curved:

With this sleuthing, I'm pretty sure that the xlat and xlon variables in the netcdf (both 2D arrays, with dimensions equal to the length of iy and xj - 86 x 111) are meant to be the unprojected decimal degree equivalent of the cross points in projected coordinate space, since they are in decimal degrees and do vary across a given row/column. The only remaining odd point is that if I build the grid cell using the xlat and xlon values (NOT a square grid, but curved) and define the coordinate system as WGS84, the resulting grid (in green) matches the shape and curvature of the online preview given for the dataset (the default for open layers is web mercator) but not the grid I built with the iy and xj values and the projection information, reprojected to WGS84 (in blue):

And if I reproject that WGS84 grid built with the xlat and xlon values to the custom lambert conformal conic projection defined in the metadata, it (in green) doesn't line up with the square grid constructed using the iy and jx values (in blue):

Therefore, I'm a bit unsure as to how they pulled those xlat and xlon values, and will see if David has any insight.

So for now, to reconstruct the grid, I am using the iy and jx values to build cell polygons, with the crs defined per the metadata in the netcdf. I think this is the most robust approach, and it results in a square grid with the correct cell size. I've got this in a jupyter notebook for now, and will run it by David to confirm it all looks consistent with his understanding of how the data is stored. If it does, I'll put it in script format, and we can call it from our targets pipeline with reticulate. If not, I'll rework it with David's advice.

oh, wow. Thanks @hcorson-dosch - I figured there was something a little different about this dataset. Seems having the grid in the native dataset CRS is best, and then projecting the lake centroids into that to calculate the match-ups. Verifying with David/Dave on this is :+1:

Should we set this up to munge to 6 NetCDF files: one for each GCM that includes the 3 time periods (1980-1999, 2040-2059, 2080-2099)? Curious what sort of organization would be most helpful for https://github.com/USGS-R/lake-temperature-process-models

Lindsay and I chatted, and we decided to go with 1 netCDF per GCM, for now. Lindsay's plan is to save the timeseries data associated with each time period as 'chunks' within the netCDF, which will make it easier to pull out the full timeseries for each variable for each time period in lake-temperature-process-models

Moving further NetCDF formatting discussions over to #252

Related issues that stemmed from this one #236, #237, #244, #252, #254, #258, #264, #265, #273. Will close this one once we join the current branch gcm_driver_data_munge_pipeline with main.

Just revisiting this since we've decided to store only the WGS84 decimal degree coordinates as lat/lon values in the intermediary GCM netCDF files, rather than including the projected coordinates and a grid mapping variable (see #252).

I wanted to check back in on some of the earlier funkiness I'd noticed, since in later iterations of this work (before we switched to rebuilding the grid in R) I had realized that the package I had been using to define the projection of the grid in Python (pyproj) generated a proj4 string that then wasn't recognized correctly by R when I read in the exported shapefile, which led me to switch to using the cartopy package proj4 string instead. When I made that switch, it also fixed the two issues I'd seen above, re: alignment of the grids built either from the projected coordinates or the DD coordinates.

Previously:

The only remaining odd point is that if I build the grid cell using the xlat and xlon values (NOT a square grid, but curved) and define the coordinate system as WGS84, the resulting grid (in green) matches the shape and curvature of the online preview given for the dataset (the default for open layers is web mercator) but not the grid I built with the iy and xj values and the projection information, reprojected to WGS84 (in blue):

But with the use of the cartopy proj4 string when defining the grid built with the projected coordinates, that grid, if reprojected (in blue), does align (not perfectly, but well) with the grid built using the WGS84 xlat and xlon values (in green):

And again, previously:

And if I reproject that WGS84 grid built with the xlat and xlon values to the custom lambert conformal conic projection defined in the metadata, it (in green) doesn't line up with the square grid constructed using the iy and jx values (in blue):

But now, when I used the same cartopy proj4 string to define the projection of the grid built fromt he projected iy and jx coordinates (in blue), the grid built using the DD xlat and xlon values, if reprojected from WGS84 to the grid crs (in green), aligns well, if not perfectly:

So I think we can now be confident that the xlat and xlon values are the unprojected decimal degree equivalent of the cross points in projected coordinate space, iy and jx

DOI-USGS / lake-temperature-model-prep