DOI-USGS / ds-pipelines-targets-example-wqp

An example targets pipeline for pulling data from the Water Quality Portal (WQP)
Other
10 stars 14 forks source link

Update regex used to find bad identifiers #91

Closed lekoenig closed 2 years ago

lekoenig commented 2 years ago

This PR updates the regex used to find bad site identifiers to reflect recent changes to the WQP web service.

Using the previous function identify_bad_ids() we would have found the following "bad" identifiers in the test-set below:

> bad_sites_test <- tibble(MonitoringLocationIdentifier = c("USGS-01234", 
"COE/ISU-27630001","ALABAMACOUSHATTATRIBE.TX_WQX-TL-007","NALMS-C41.59831,-93.60861"))
> identify_bad_ids(bad_sites_test)
# A tibble: 2 x 1
  site_id                            
  <chr>                              
1 COE/ISU-27630001                   
2 ALABAMACOUSHATTATRIBE.TX_WQX-TL-007

With the changes in this PR, the same function returns no bad identifiers because the options above are now accepted by the WQP web service.

> identify_bad_ids(bad_sites_test)
# A tibble: 0 x 1
# ... with 1 variable: site_id <chr>

I also tried this out using the following inputs in _targets.R:

# Specify coordinates that define the spatial area of interest
# lat/lon are referenced to WGS84
coords_lon <- c(-96.333, -87.8, -89)
coords_lat <- c(42.547, 45.029, 35)

# Specify arguments to WQP queries
# see https://www.waterqualitydata.us/webservices_documentation for more information 
wqp_args <- list(sampleMedia = c("Water","water"),
                 siteType = "Lake, Reservoir, Impoundment",
                 # return sites with at least one data record
                 minresults = 1, 
                 startDateLo = start_date,
                 startDateHi = end_date)

Before incorporating these changes, the resulting query contains 12 sites with "/" in the site identifiers, which the pipeline informed us of:

> tar_make(p2_site_counts_grouped)
Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1; sf_use_s2() is TRUE
v skip target p1_global_grid
v skip target p1_wqp_params_yml
* start target p1_AOI
* built target p1_AOI
v skip target p1_wqp_params
* start target p1_AOI_sf
* built target p1_AOI_sf
v skip target p1_char_names_crosswalk
* start target p1_global_grid_aoi
* built target p1_global_grid_aoi
v skip target p1_char_names
...
* start target p1_wqp_inventory_aoi
Attempting to harmonize different site CRS...
Returned 4982 sites within area of interest.
* built target p1_wqp_inventory_aoi
* start target p2_site_counts
* built target p2_site_counts
* start target p2_site_counts_grouped
Some site identifiers contain undesired characters and cannot be parsed by WQP. Assigning 12 sites and 364 records with bad identifiers to their own download groups so that they can be queried separately using a different method.
* built target p2_site_counts_grouped
* end pipeline: 14.449 minutes

After incorporating the changes in this PR and re-running tar_make(p2_site_counts_grouped), no sites are flagged as "bad" and so we could pull all data by site id rather than the back-up bounding box approach 🎉

> tar_load(p2_site_counts_grouped)
> unique(p2_site_counts_grouped$pull_by_id)
[1] TRUE
>

Nice work getting this change incorporated into the web service, @jordansread!

Closes #89