DOI-USGS / ds-pipelines-targets-example-wqp

An example targets pipeline for pulling data from the Water Quality Portal (WQP)
Other
10 stars 14 forks source link

Save "siteInfo" attributes in a new target #102

Closed lekoenig closed 1 year ago

lekoenig commented 1 year ago

This PR grabs the siteInfo attribute from each branch of p2_wqp_data_aoi to create a new target called p2_wqp_site_info. The site metadata attribute does not get retained when p2_wqp_data_aoi is built by binding the individual branches, and so the new site info target contains a table that users can reference or join with the data table if they wish.

As of the last time I built p2_wqp_data_aoi, we downloaded data from 552 sites within our example "watershed" and so p2_wqp_site_info contains site metadata for 552 sites:

tar_load(p2_wqp_data_aoi)
length(unique(p2_wqp_data_aoi$MonitoringLocationIdentifier))
[1] 552
tar_load(p2_wqp_site_info)
dim(p2_wqp_site_info)
[1] 552  43
names(p2_wqp_site_info)
 [1] "station_nm"                                      "agency_cd"                                       "site_no"                                        
 [4] "dec_lat_va"                                      "dec_lon_va"                                      "hucCd"                                          
 [7] "OrganizationIdentifier"                          "OrganizationFormalName"                          "MonitoringLocationIdentifier"                   
[10] "MonitoringLocationName"                          "MonitoringLocationTypeName"                      "MonitoringLocationDescriptionText"              
[13] "HUCEightDigitCode"                               "DrainageAreaMeasure.MeasureValue"                "DrainageAreaMeasure.MeasureUnitCode"            
[16] "ContributingDrainageAreaMeasure.MeasureValue"    "ContributingDrainageAreaMeasure.MeasureUnitCode" "LatitudeMeasure"                                
[19] "LongitudeMeasure"                                "SourceMapScaleNumeric"                           "HorizontalAccuracyMeasure.MeasureValue"         
[22] "HorizontalAccuracyMeasure.MeasureUnitCode"       "HorizontalCollectionMethodName"                  "HorizontalCoordinateReferenceSystemDatumName"   
[25] "VerticalMeasure.MeasureValue"                    "VerticalMeasure.MeasureUnitCode"                 "VerticalAccuracyMeasure.MeasureValue"           
[28] "VerticalAccuracyMeasure.MeasureUnitCode"         "VerticalCollectionMethodName"                    "VerticalCoordinateReferenceSystemDatumName"     
[31] "CountryCode"                                     "StateCode"                                       "CountyCode"                                     
[34] "AquiferName"                                     "LocalAqfrName"                                   "FormationTypeText"                              
[37] "AquiferTypeName"                                 "ConstructionDateText"                            "WellDepthMeasure.MeasureValue"                  
[40] "WellDepthMeasure.MeasureUnitCode"                "WellHoleDepthMeasure.MeasureValue"               "WellHoleDepthMeasure.MeasureUnitCode"           
[43] "ProviderName"                                   
>

@lindsayplatt, there's not a rush to merge this PR, I'm just trying to wrap up a couple lingering issues before some outreach that @padilla410 is planning for this spring. I'd suggest ~3 weeks turnaround, but let me know if you're swamped. This is the last set of code changes I'll make before releasing a new tag (see #103).

Closes #95

lindsayplatt commented 1 year ago

I did just notice that the actual data frame returned appears to be missing 6 of the columns that are contained in the siteInfo field. So, we could grab siteInfo from the new query instead:

station_data <- dataRetrieval::readWQPdata(
  siteid = "USGS-04024315",
  service = "Station"
)

siteInfo <- attr(station_data, 'siteInfo')

sum(!names(station_data) %in% names(siteInfo))
sum(!names(siteInfo) %in% names(station_data))

names(siteInfo)[!names(siteInfo) %in% names(station_data)]

[1] "station_nm" "agency_cd"  "site_no"    "dec_lat_va" "dec_lon_va" "hucCd"
lekoenig commented 1 year ago

Great point about scalability, @lindsayplatt. I considered that, but I don't think I fully considered the extent of i/o required to get the attributes at the national scale (for qs and rds files), or that this approach may not work for all file types. I think making a call to the "Station" service is a good idea. And maybe we should set ignore_attributes = TRUE in our actual WQP call if we do.

I did just notice that the actual data frame returned appears to be missing 6 of the columns that are contained in the siteInfo field

I think these are just repeated columns that get created within readWQPdata (here), so we could even recreate those. I assume they're just meant to be more general column names and/or to match NWIS format.

lindsayplatt commented 1 year ago

I like the idea of using ignore_attributes = TRUE in the WQP data call. It may cut down on some of the file sizes for those big pulls.

lekoenig commented 1 year ago

OK, in the latest commits I've reverted to calling the WQP "Station" service rather than trying to subset the attributes from the downloaded data frames. I agree that this approach is more robust for pipelines of varying scales.

Specific edits include: