DOI-USGS / nhdplusTools

See official repository at: https://code.usgs.gov/water/nhdplusTools
https://doi-usgs.github.io/nhdplusTools/
Creative Commons Zero v1.0 Universal
84 stars 32 forks source link

Need to be able to pull subsets of NHDPlusV2 attributes in a flexible way. #304

Closed dblodgett-usgs closed 1 year ago

dblodgett-usgs commented 1 year ago

Will flesh out details here later, but we need to be able to pull out subsets of attributes from one or more data sources. A good starting point is the base attributes that are in the NLDI and in sciencebase. How to work with time series of such attributes is something to consider in design.

dblodgett-usgs commented 1 year ago

Chatting with @mikejohnson51 just now...

#' @param varname character vector of desired variables retrieved from /link{discover_catchment_characteristics}
#' @param ids numeric vector of identifiers from the specified reference fabric
#' @param reference_fabric (not used) will be used to allow future specification of alternate reference fabrics
get_catchment_characteristics(varname, ids = NULL, reference_fabric = "nhdplusv2")
dblodgett-usgs commented 1 year ago

OK -- I stood up some sample data and as of now we can do:

metadata <- readr::read_csv("https://www.sciencebase.gov/catalog/file/get/636a7053d34ed907bf6a6874?name=metadata_table.csv")

i <- metadata[metadata$ID == "TOT_POPDENS90",]

id <- gsub("https://www.sciencebase.gov/catalog/item/", "", i$datasetURL)

url <- paste0("s3://prod-is-usgs-sb-prod-publish/636a7053d34ed907bf6a6874/", id, "_tot.parquet?region=us-west-2")

p <- arrow::read_parquet(url)

library(arrow)
library(dplyr)

ds <- open_dataset(url)

library(nhdplusTools)
library(sf)

start_point <- st_sfc(st_point(c(-89.362239, 43.090266)), crs = 4269)
start_comid <- discover_nhdplus_id(start_point)

flowline <- navigate_nldi(list(featureSource = "comid", 
                               featureID = start_comid), 
                          mode = "upstreamTributaries", 
                          distance_km = 1000)

sub <- filter(ds, COMID %in% flowline$UT_flowlines$nhdplus_comid)

collect(sub)
dblodgett-usgs commented 1 year ago

We will need a characteristic discovery function...

#' @param search_term character term to search the metadata for
#' @param reference_fabric character (unused per above)
#' @return  data.frame containing metadata for available characteristics
discover_catchment_characteristics(search_term = NULL, reference_fabric = "nhdplusv2")
cheginit commented 1 year ago

In HyRiver, I created a "metadata" of all these characteristics by using ScienceBase's API and iteratively pulling the metadata for "children" datasets. It was a bit tricky since their metadata didn't have the same exact structure, so there were some edge cases that I need to take care of. You can find the function here.

If interested, I can save the generated dataframe as a parquet file and share it with you.

dblodgett-usgs commented 1 year ago

Thanks! I actually have a whole workflow to clean up and unify this into a metadata table, a subset of which is here. https://www.sciencebase.gov/catalog/file/get/636a7053d34ed907bf6a6874?name=metadata_table.csv

Someone else within the survey did a similar thing to your approach as well -- I would imagine he will take part in this thread at some point.

How did you deal with the annual time varying characteristics? Years treated as unique characteristics?

cheginit commented 1 year ago

Right, I was actually inspired by NLDI's metadata bit and since it didn't have all characteristics, I decided to pull it from the source!

Regarding the time varying chars, yes, each year is considered a unique char. There are 609 rows and the columns are "name", "type", "theme", "description", "url" (data url), and "meta" (metadata url).

dblodgett-usgs commented 1 year ago

OK -- I would be V interested in learning from / stealing your logic for the yearly ones so I can build them into this set of parquet files. Sounds like you have the heavy lifting done.

cheginit commented 1 year ago

I didn't do anything specific for the years, though, some chars are named based on year there are several of them. The only tricky part was that some "child" datasets had their own "children" so for those I had to go one level deeper. Maybe some of those were time varying, I don't remember exactly.

mjcashman commented 1 year ago

Yes, I think the NLCD data ended up having an additional layer, specifically for the 2016 (and 2019) NLCD refreshes, which also contained refreshes for all previous years. For my handling of time-varying characteristics, I pulled all the years directly, then had a separate function that matched the year characters contained in the variable names to match up observation date for my paired datapoints, and return a single static (non-year varying) value for all variables in that data type. It does sound like these new functions might want to approach things from a slightly different way. I could see us recoding the time-varying variables to have a separate year or date column, to make it more streamlined for filtering and delivery, but that'll be altering Mike's original data structure and metadata.

mewieczo commented 1 year ago

Over the years, the 2 most common type of request I have been asked for is to provide mosly TOT valuesfor these data by HUC (2,4,8, 12, etc) or was given a list of USGS stream gages to provide data for. I do have my own crosswalks to do this sort of data retrieving as all the data resides on my office computer.

mjcashman commented 1 year ago

For the initial pull of the time-varying characteristics, it does seem like it would be useful to either pull the full range of time, or range of time, or just a specific time. So including date_range as a separate parameter would be useful. NLCD is broken down by years, I think others are at a monthly timestep? Would need to check.

dblodgett-usgs commented 1 year ago

OK -- yeah, I like these ideas. @mikejohnson51 had suggesting adding an AOI input which aligns with the use case to use HUCs or stream gages to a degree. I like the idea of being able to specify an AOI from a controlled list that we already have the reference fabric ids for.

I do want to add a field for what time period the characteristic represents that would allow a person to work over the time varying variables in a flexible way as well. I think we can work that into the metadata table and just work up some nice examples showing how to form "time series" out of the data.

I'll put some more thought into these use cases and work up an updated function signature. @mikejohnson51 do you have anything to add to the pile?

mewieczo commented 1 year ago

@mjcashman-usgs , yes there are, for example, water balance model inputs and outputs that are by month that are parsed out in yearly files with the numeric year embedded in the file name.

dblodgett-usgs commented 1 year ago

OK -- we are getting there.

library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(nhdplusTools)
library(sf)
#> Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

metadata <- readr::read_delim("https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/5669a79ee4b08895842a1d47/metadata_table.tsv")
#> Rows: 4281 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (8): ID, description, units, datasetLabel, datasetURL, themeLabel, theme...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

i <- metadata[metadata$ID == "TOT_IEOF",]

id <- gsub("https://www.sciencebase.gov/catalog/item/", "", i$datasetURL)

bucket <- arrow::s3_bucket("s3://prod-is-usgs-sb-prod-publish", anonymous = TRUE, region = "us-west-2")

p <- arrow::read_parquet(bucket$path(paste0(id, "/", id, "_tot.parquet")))

ds <- open_dataset(paste0("s3://anonymous@prod-is-usgs-sb-prod-publish/", 
                          id, "/", id, "_tot.parquet?region=us-west-2"))

start_point <- st_sfc(st_point(c(-89.362239, 43.090266)), crs = 4269)
start_comid <- discover_nhdplus_id(start_point)

flowline <- navigate_nldi(list(featureSource = "comid", 
                               featureID = start_comid), 
                          mode = "upstreamTributaries", 
                          distance_km = 1000)

cat <- nhdplusTools::subset_nhdplus(as.integer(flowline$UT_flowlines$nhdplus_comid), nhdplus_data = "download", flowline_only = FALSE)
#> All intersections performed in latitude/longitude.
#> Reading NHDFlowline_Network
#> Spherical geometry (s2) switched off
#> Spherical geometry (s2) switched on
#> Writing NHDFlowline_Network
#> Reading CatchmentSP
#> Spherical geometry (s2) switched off
#> Spherical geometry (s2) switched on
#> Writing CatchmentSP
#> Spherical geometry (s2) switched off
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar
#> Spherical geometry (s2) switched on
#> Spherical geometry (s2) switched off
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar
#> Spherical geometry (s2) switched on
#> Spherical geometry (s2) switched off
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar
#> Spherical geometry (s2) switched on

sub <- filter(ds, COMID %in% flowline$UT_flowlines$nhdplus_comid)

att <- collect(sub)

plot_cat <- left_join(cat$CatchmentSP, att, c("featureid" = "COMID"))

plot(plot_cat["TOT_IEOF"])

Created on 2022-11-29 with reprex v2.0.2

dblodgett-usgs commented 1 year ago

We now have the following on the hydroloom branch:

library(nhdplusTools)
library(sf)
#> Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

sf::sf_use_s2(FALSE)
#> Spherical geometry (s2) switched off

metadata <- get_characteristics_metadata(search = "SATOF")

metadata[c("ID", "description", "units")]
#>           ID                                    description   units
#> 19 CAT_SATOF Percentage of Dunne overland flow as a percent percent
#> 20 ACC_SATOF Percentage of Dunne overland flow as a percent percent
#> 21 TOT_SATOF Percentage of Dunne overland flow as a percent percent

var <- "CAT_SATOF"

start_point <- st_sfc(st_point(c(-89.362239, 43.090266)), crs = 4269)
start_comid <- discover_nhdplus_id(start_point)

flowline <- navigate_nldi(list(featureSource = "comid", 
                               featureID = start_comid), 
                          mode = "upstreamTributaries", 
                          distance_km = 1000)

cat <- nhdplusTools::subset_nhdplus(as.integer(flowline$UT_flowlines$nhdplus_comid), 
                                    nhdplus_data = "download", flowline_only = FALSE)
#> All intersections performed in latitude/longitude.
#> Reading NHDFlowline_Network
#> Writing NHDFlowline_Network
#> Reading CatchmentSP
#> Writing CatchmentSP
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar
#> although coordinates are longitude/latitude, st_intersects assumes that they
#> are planar

sub <- nhdplusTools::get_catchment_characteristics(var, cat$CatchmentSP$featureid)

plot_cat <- left_join(cat$CatchmentSP, sub, c("featureid" = "comid"))

plot(plot_cat["characteristic_value"])

Created on 2023-01-31 with reprex v2.0.2

dblodgett-usgs commented 1 year ago

See https://doi-usgs.github.io/nhdplusTools/news/index.html this is now on main and on its way to CRAN