DOI-USGS / ds-pipelines-targets-example-wqp

An example targets pipeline for pulling data from the Water Quality Portal (WQP)
Other
10 stars 14 forks source link

Unexpected branches in `p1_wqp_data_aoi` rebuilding with change in AOI #63

Closed lekoenig closed 2 years ago

lekoenig commented 2 years ago

If I expand the area of interest to include new grids in p1_global_grid_aoi, I expect that those unchanged grids will be skipped when building p1_wqp_data_aoi since those inventory data have not changed. Instead, all branches seem to be re-building!

For this example, I've used the wqp args below and modified p1_AOI_sf to represent a rectangular polygon rather than a triangle. I then ran tar_make().

# Define which parameter groups (and CharacteristicNames) to return from WQP 
# options for parameter groups are represented in first level of 1_fetch/cfg/wqp_codes.yml
param_groups_select <- c('conductivity')

# Specify coordinates that define the spatial area of interest
# lat/lon are referenced to WGS84
coords_lon <- c(-77.063, -75.333, -75.437)
coords_lat <- c(40.547, 41.029, 39.880)
  # Create a spatial (sf) object representing the area of interest
  tar_target(
    p1_AOI_sf,
    {
      aoi <- sf::st_as_sf(p1_AOI, coords = c("lon","lat"), crs = 4326) %>%
        summarize(geometry = st_combine(geometry)) %>%
        sf::st_cast("POLYGON")
      sf::st_bbox(aoi) %>%
        st_as_sfc()
    }
  ),

Now, say I want to expand the area of interest to capture sites further westward, so I adjust the lat/lon in _targets.R:

coords_lon <- c(-77.63, -75.333, -75.437)

Here's a visual of the expanded AOI, where the new AOI is in black on top of the old AOI in blue. The only data that should change therefore, are the sites in grids 11572 and 11752.

Rplot

But when I re-run tar_make(), data is re-downloaded for all of the 1639 sites 😱

...
* start branch p1_wqp_data_aoi_f5119172
Retrieving WQP data for sites 1:83
* built branch p1_wqp_data_aoi_f5119172
* start branch p1_wqp_data_aoi_923ae52f
Retrieving WQP data for sites 84:190
* built branch p1_wqp_data_aoi_923ae52f
* start branch p1_wqp_data_aoi_20d55077
Retrieving WQP data for sites 191:690
* built branch p1_wqp_data_aoi_20d55077
* start branch p1_wqp_data_aoi_03c2aacd
Retrieving WQP data for sites 691:1190
* built branch p1_wqp_data_aoi_03c2aacd
* start branch p1_wqp_data_aoi_75169c97
Retrieving WQP data for sites 1191:1396
* built branch p1_wqp_data_aoi_75169c97
* start branch p1_wqp_data_aoi_67aa4733
Retrieving WQP data for sites 1397:1639
...
lekoenig commented 2 years ago

To me, there are two likely culprits here:

  1. We currently include an attribute called site_n in p1_site_counts_grouped. The site number was meant to be useful to inform the user about the progress of the data download step. However, if we expand the AOI, I could see these site numbers being re-tabulated, thus triggering a full download.
  2. The lat/lon fields in p1_site_counts_grouped are of class numeric, however they appear to have site or row numbers attached, for example:
> tar_load(p1_site_counts_grouped)
> class(p1_site_counts_grouped$lat)
[1] "numeric"
> p1_site_counts_grouped$lat[1]
     67 
39.9389 
> 

If those values ("67" above) get changed, I wonder if that triggers a rebuild of all the data. In either case, it would probably be more robust to coerce lat and lon to numeric in transform_site_locations().

jordansread commented 2 years ago

That's a tricky gotcha - helpful info for inspection causes a greedy rebuild.

One way to handle that and retain the intent of the site_n information would be to use group_site_n instead, which would be site_n, but restarts at 1 for each of the groups. For example, you'd have two group_site_n values of "1" if you had two groups.

Looks like p1_site_counts_grouped$lat is perhaps a named vector? Like c("67" = 39.9389)