How are we accounting for tidal sites?

USGS-R / drb-inland-salinity-ml

Code repo for Delaware River Basin machine learning models that predict inland salinity.

Creative Commons Zero v1.0 Universal

3 stars 4 forks source link

How are we accounting for tidal sites? #241

Closed lekoenig closed 1 year ago

lekoenig commented 1 year ago

We currently use a few different approaches to try to restrict our analysis to inland (freshwater) streams in the DRB, including excluding sites along the mainstem Delaware River below the head-of-tide in ~Trenton and only requesting certain NWIS site types. Based on initial analysis of the model residuals it seems like some tidal sites remain in the dataset and are potentially skewing model error.

I want to revisit, how are we accounting for tidal sites in the discrete WQP data? Is there another filtering step we should use when creating the input data?

lekoenig commented 1 year ago

Looking at the WQP metadata, there are a few columns that may contain information or metadata related to likelihood of tidal influence:

MonitoringLocationTypeName (e.g. MonitoringLocationTypeName == "Stream: Tidal stream")
MonitoringLocationDescriptionText (only available from the "Site" service)
ActivityMediaName and ActivityMediaSubdivisionName (e.g. ActivityMediaSubdivisionName == "Estuary")
HydrologicEvent (e.g. HydrologicEvent == "Tidal action")

In the original DRB harmonization code, there are no steps that explicitly aim to filter out tidally-influenced sites. However, the code does filter records so that ActivityMediaSubdivisionName only includes a handful of accepted values ("Surface Water", "Suspended", "Water", "Bottom material") that would omit any "Estuary" samples.

This repository (what we are doing already):

We currently take the following steps to modify the discrete data produced by the DRB harmonization code mentioned above when creating the input dataset used for modeling:

ActivityMediaName does not equal "Sediment" (ActivityMediaName equals "Water")
HydrologicEvent is not equal to "Spill" or "Volcanic action"
Data passes all QC flags introduced in the original DRB harmonization code
ActivityStartDate is within the requested date range (10/1/1979 - 12/31/2021)
MonitoringLocationTypeName is not equal to "Stream: Ditch" and does not contain the string "tidal"
We grouped records by collecting organization (OrganizationIdentifier), site id (MonitoringLocationIdentifier), lat/lon (LongitudeMeasure and LatitudeMeasure), and date-time and if there were duplicates, kept one record within the set. We preferentially retained lab records, then records labeled as "Specific conductance", then field values, then field values that represent the mean value of a set of samples (param == "Specific conductance, field, mean").
We filter out discrete water quality sites from mainstem reaches along the Delaware River below the head-of-tide (when building the target p2_wqp_SC_filtered).

This repository (what further steps could we take):

Based on the above, I unfortunately think we've used all of the low-hanging fruit/metadata for identifying tidal sites. There's a couple other steps I think we could add:

Filter out records where HydrologicEvent == "Tidal action". However, there are only 7 total records that match this criteria and they are not proximal to the estuary mouth where we've seen high model residuals, so I'm not sure how helpful this will be.

For sites that have multiple observations (giving us confidence 1 storm event isn't driving the summarized values, for example), we could omit sites where the mean value is greater than some maximum cutoff like 1/2 the specific conductance of seawater.
We could omit any sites that are matched to segments that drain directly to one of the mainstem reaches below the head of tide. I don't really have any evidence that this would omit tidally-influenced sites, but seems like a reasonable proxy assumption.

jds485 commented 1 year ago

Thanks for summarizing these steps! I think most of this information should be included in our manuscript text or SI.

Do you know how tidal streams were identified for the WQP? If it's not from NWIS, we could use the NWIS codes / flags for tidal reaches as a way to identify PRMS segments to omit. That would be a more time-consuming but possibly more accurate way to implement your 3rd suggestion. The 3rd suggestion would be my preference and I think straightforward to do using the distance matrix.

lekoenig commented 1 year ago

Do you know how tidal streams were identified for the WQP?

My understanding is that this information is supplied by the user. I queried all of the DRB HUC8 boundaries for WQP sites with MonitoringLocationTypeName == "Stream: Tidal stream" (using sampleMedia = c("Water", "water", "Sediment") as the only other search criteria). This search returned 21 sites, all of which were managed by the USGS New Jersey WSC. So this makes me believe that the site type is entered this way to match the information in NWIS, and that other water quality monitoring organizations don't seem to be using this field to identify tidal streams.

we could use the NWIS codes / flags for tidal reaches as a way to identify PRMS segments to omit. That would be a more time-consuming but possibly more accurate way to implement your 3rd suggestion.

Agreed that using NWIS sites flagged as tidal provides useful information for which segments to omit. The risk is that the 36 NWIS tidal sites in the DRB (that get matched to 16 unique river segments) aren't comprehensive, so we might retain tidally-influenced WQP sites if they fall on a reach that doesn't overlap one of those highlighted in yellow below.

# code to ID which NWIS sites are flagged as tidal
library(dataRetrieval)
library(sf)

source("2_process/src/match_sites_reaches.R")
targets::tar_load(p2_sites_w_segs)
targets::tar_load(p1_reaches_sf)

drb_huc8s <- c("02040101","02040102","02040103","02040104","02040105",
               "02040106","02040201","02040202","02040203","02040204",
               "02040205","02040206","02040207")

tidal_sites <- list()

for(i in seq_along(drb_huc8s)){
  dat <- whatNWISsites(huc = drb_huc8s[i])
  dat_sub <- filter(dat, site_tp_cd == "ST-TS")
  tidal_sites[[i]] <- dat_sub
}

tidal_sites_df <- bind_rows(tidal_sites) |>
  mutate(site_id = paste0("USGS-", site_no)) |>
  rename(lon = dec_long_va,
         lat = dec_lat_va)

# these NWIS tidal sites do not contain SC data
all(tidal_sites_df$site_id %in% p2_sites_w_segs$site_id)

# match NWIS tidal sites to NHGF/PRMS segments
tidal_sites_w_segs <- get_site_flowlines(reach_sf = p1_reaches_sf, 
                                         sites = tidal_sites_df, 
                                         sites_crs = 4269, 
                                         max_matches = 1)

# plot tidal reaches
tidal_reaches <- filter(p1_reaches_sf, segidnat %in% tidal_sites_w_segs$segidnat)

The 3rd suggestion would be my preference and I think straightforward to do using the distance matrix.

Below is some code I've used to identify reaches that we assume might be tidal due to their direct connectedness to the mainstem Delaware River below the head-of-tide. This schema identifies and flags 61 segments compared to 16 above:

# Assume all reaches that flow *into* one of the mainstem reaches below the 
# head-of-tide are also tidally-influenced
mainstem_reaches_tidal <- c("2771_1","2769_1","2768_1","2767_1","2764_1",
                            "2762_1","2763_1","2759_1","2757_1","2752_1",
                            "2753_1","2755_1","2772_1","388_1","389_1",
                            "385_1","386_1","383_1","382_1","377_1","378_1",
                            "376_1","351_1","344_1","346_1","333_1")
segs_mainstem_tidal <- filter(p1_reaches_sf, subsegid %in% mainstem_reaches_tidal)
fromsegs <- stringr::str_split(segs_mainstem_tidal$fromsegs, ";") 
fromsegs <- do.call("c", fromsegs)
fromsegs_vctr <- fromsegs[c(1:length(fromsegs))]
segs_tribs_tidal <- p1_reaches_sf |>
  mutate(subsegseg = as.character(subsegseg)) |>
  filter(subsegseg %in% fromsegs_vctr | is.na(toseg))

To capture the tributaries that flow directly into the Delaware Bay Estuary I also assume that segments with toseg == NA are also tidally-influenced. Note that this leads to three "false positives," one in the eastern part of the watershed in NJ and two others that are "split segments" in the northern part of the watershed (not shown). We could omit these false positive reaches from the set of segments that we flag and exclude from the training data, the evaluation metrics, or both.

@jds485 Depending on how this fits into your modeling workflow, is it helpful for me to submit a PR? I think we discussed using this information for model evaluation right now (and not re-training the model), and so I would just add a target that contains all of the subsegid's that we assume are tidal. If we want to actually filter out data, I think we could do that by modifying an existing target, p2_wqp_SC_filtered.

jds485 commented 1 year ago

Just so I'm understanding, the 1st map shows reaches that have at least 1 NWIS SC site that is marked as tidal. It is not based on the NWIS reach codes in nhdplusTools (which would hopefully look more like the 2nd map).

We could omit these false positive reaches

Yes, they should be omitted.

is it helpful for me to submit a PR

Yes, and it can be "a target that contains all of the subsegid's that we assume are tidal."

jds485 commented 1 year ago

Do you know how many observations are within the assumed tidal reaches?

lekoenig commented 1 year ago

the 1st map shows reaches that have at least 1 NWIS SC site that is marked as tidal. It is not based on the NWIS reach codes in nhdplusTools (which would hopefully look more like the 2nd map).

The first map in my comment above shows reaches that have at least 1 NWIS site labeled with site_tp_cd == "ST-TS" (not exclusive to just specific conductance sites). I'm not sure I'm following the nhdplusTools question. Does NHDPlus contain an attribute related to tidal extent? If so, I wasn't aware of it.

Do you know how many observations are within the assumed tidal reaches?

Using p2_site_list_nontidal_csv as an indication of data density, it looks like there are 75 unique sites with 31,686 observation days of data 😮

readr::read_csv(p2_site_list_nontidal_csv, show_col_types = FALSE) |>
   filter(subsegid %in% segs_tribs_tidal$subsegid & !subsegid %in% c("357_1", "8_1", "3_1")) |>
   pull(count_days_total) |>
   sum()
#> [1] 31686

~74% of those observation days come from 4 sites that lie along mainstem_reaches_tidal and have NWIS continuous data (01464600, 014670261, 01467200, and 01477050). This suggests to me that mainstem_reaches_tidal was used to filter discrete samples thought to be tidally-influenced but that a similar filter step was not applied to the continuous NWIS data.

Either way, assuming tidal reaches does leave out a decent chunk of the data. I guess the hope would be that doing so improves the quality of the data used for the target application.

It's also possible this assumption paints too broad a brush. Based on p2_SC_observations, I'm guessing the model residuals are pretty high for segment 1263_1, because the mean specific conductance exceeds the 99% quantile (which is 3,701 uS/cm) of mean SC across all reaches in the network. I believe that segment is the Lewes And Rehoboth Canal near the estuary mouth.

p2_SC_observations |> 
   group_by(subsegid) |> 
   summarize(mean_SC = mean(mean_value, na.rm = TRUE), n = n()) |>
   arrange(desc(mean_SC))
#>  A tibble: 311 x 3
#>    subsegid mean_SC     n
#>    <chr>      <dbl> <int>
#>  1 1263_1    39367.    10
#>  2 2140_1     7463.     1
#>  3 2758_1     4384.    28
#>  4 2749_1     3874.    15
#>  5 2137_1     2148.    16
#>  6 870_1      1080.    23
#>  7 1257_1      914.     2
#>  8 865_1       831.  1316
#>  9 864_1       797.   137
#> 10 367_1       789.    52
#>  ... with 301 more rows

jds485 commented 1 year ago

Does NHDPlus contain an attribute related to tidal extent?

Yes, there is a column named Tidal that is binary indicating if the reach is considered Tidal? 0(no) or 1(yes). Based on the documentation, I'm not sure it's that useful. It seems like this indicator was determined from an elevation filter (where min and max elevation are less than 6 m and the mean annual flow is greater than 0). See section "Identifying Tidal Flowlines" here.

jds485 commented 1 year ago

it looks like there are 75 unique sites with 31,686 observation days of data

That is a lot of data! Those 4 NWIS sites are almost 15% of the data we currently use. Good discovery that the mainstem filter was not applied to the NWIS data, but it also makes me wonder if our filtering is too broad. The historical upstream extent of the salt front is near Philadelphia, so sites upstream of that are likely not affected much.

I believe that segment is the Lewes And Rehoboth Canal near the estuary mouth.

I think we would be justified dropping data from that segment because canals can have 2-way flow, especially this close to the estuary mouth. I might reach out to Salme and see if COAWST could be used to support reaches to drop based on likely migration of salinity up the tributaries from tidal fluctuation or other reasons.

lekoenig commented 1 year ago

Does NHDPlus contain an attribute related to tidal extent?

Yes, there is a column named Tidal that is binary indicating if the reach is considered Tidal? 0(no) or 1(yes). Based on the documentation, I'm not sure it's that useful. It seems like this indicator was determined from an elevation filter (where min and max elevation are less than 6 m and the mean annual flow is greater than 0). See section "Identifying Tidal Flowlines" here.

Woah, I was not aware of this attribute, thanks for the tip! If we go with this general approach for flagging tidal reaches, I think using the NHD attribute is an improvement over my assumption that "all tributaries flowing into below-head-of-tide-mainstem-reaches are tidal" and that we should just use that (same caveats as above r.e. lots of data lies along these reaches).

jds485 commented 1 year ago

Thanks for doing this quick analysis! We would lose a lot of data for urban areas using this filter. I sent a chat message to Liv and Salme and sent a link to this to get their opinion on it.

lekoenig commented 1 year ago

We got some feedback from Liv Herdman and Salme Cook (NY WSC) and came up with the following options for estimating which river segments might be receiving some salts from the estuary vs. the watershed (this is what I mean when I say "tidally-influenced" for the sake of this project).

Liv and Salme confirmed that we'd be justified in excluding all reaches defined as "tidal" in NHDPlus because at some point during the year these reaches will be within the salt front (they point out that these dynamics can be dependent on season due to fluctuations in river discharge). This is probably the easiest - but most conservative - option.
In the NHDPlus dataset, 6 m elevation is used to benchmark the "tidal" reaches, but is this really appropriate for the DRB? Liv pointed out that we might be able to refine this elevation criteria using a few gages like 01474500, which Salme uses for setting boundary conditions in the COAWST model. This seems doable, could be applied to any reaches in the DRB, and might allow us to retain more data than option 1.
We could look for NWIS gages (not just those with specific conductance) and inspect the gage height time series to infer tidal influence for specific reaches. We could do this programmatically, or potentially use NWIS mapper to inspect reaches along the estuary, zoom into the gage, and check whether there appears to be a tidal signal.
We could filter the data to account for seasonality in tidal influence. For example, we could omit data between Chester and Trenton during the summer months when the head-of-tide is likely higher in the watershed, and keep data during the winter when the salt front is lower down. This is pretty interesting, but probably the most complex option because we'd have to identify what a meaningful threshold is before applying any filtering steps (e.g. what months to use Chester vs. Trenton as head-of-tide).

Jared and I discussed these options and are going to start with option 1. @jds485, feel free to edit this comment or add any take-aways you have as well.

lekoenig commented 1 year ago

Does NHDPlus contain an attribute related to tidal extent?

Yes, there is a column named Tidal that is binary indicating if the reach is considered Tidal? 0(no) or 1(yes). Based on the documentation, I'm not sure it's that useful. It seems like this indicator was determined from an elevation filter (where min and max elevation are less than 6 m and the mean annual flow is greater than 0). See section "Identifying Tidal Flowlines" here.

~~I think this cutoff is actually 600 m? From the documentation linked above:~~ Ah, nevermind! The units of maxelevsmo and minelevsmo are in centimeters, so the cutoff is 6 m.

The tidal screen was defined as follows:

Select NHDFlowline features where PlusFlowlineVAA->MAXELEVSMO < 600 AND PlusFlowlineVAA->MINELEVSMO < 600 AND EROM_MA0001->Q0001A > 0.

On each flow path included in step 1, find the most upstream NHDFlowline feature and navigate downstream with divergences.

Add NHDFlowline features found in the step 2 navigations to the selected set from step

Set PlusFlowlineVAA.Tidal attribute to 1 for each feature in the selected set.

lekoenig commented 1 year ago

I'm playing around with option 2 in the comment above. I identified active NWIS sites that met the following criteria:

marked with site_tp_cd == "Estuary" or site_tp_cd == "ST-TS"
the time series for the past week (5/14/23 - 5/19/23) showed fairly obvious sub-daily fluctuations in gage height, ~~and/~~ or
the gage "remarks" field on the NWIS page indicated that flow is sometimes influenced by tides

This isn't necessarily comprehensive, but hopefully the population of 57 sites can give us some indication of the range in elevation that is susceptible to tidal influence and that we could use to refine the 6 m criteria used in the NHD "tidal" attribute.

The NWIS site metadata sometimes (but not always) contains information about altitude

gages_tidal
#> [1] "01482100"        "01477120"        "01474500"        "01467200"        "01467081"        "01465798"       
#> [7] "01467024"        "01483700"        "01484080"        "01484084"        "01413038"        "01412130"       
#> [13] "01412150"        "01484235"        "01484085"        "01412350"        "01482800"        "01483050"       
#> [19] "01475001"        "01482320"        "01480065"        "01480120"        "01482695"        "01482170"       
#> [25] "01475553"        "01467059"        "400659074495102" "400659074495101" "01464040"        "400854074432102"
#> [31] "400854074432101" "01476550"        "01467194"        "394836075242102" "394836075242101" "395055075174602"
#> [37] "395055075174601" "395252075113902" "395252075113901" "400209074593602" "400209074593601" "393721075344702"
#> [43] "393721075344701" "393436075331702" "393436075331701" "393109075330402" "393109075330401" "394044075311902"
#> [49] "394044075311901" "394536075285102" "394536075285101" "01482650"        "01411415"        "392711075332702"
#> [55] "392711075332701" "01483511"        "01484360"   

gages_tidal_metadata <- dataRetrieval::readNWISsite(siteNumbers = c(gages_tidal))

# altitude accuracy
unique(gages_tidal_metadata$alt_acy_va)
#>  [1]    NA  0.01  1.60  0.10  2.00  0.12  1.00  0.04 10.00  5.40

# altitude of gage/land surface
unique(gages_tidal_metadata$alt_va)
#>  [1]    NA  0.00  8.00  2.68 -4.00  8.12  5.74 55.66 -0.77  5.00

# altitude datum
unique(gages_tidal_metadata$alt_datum_cd)
#> [1] NA       "NAVD88" "NGVD29"

# method altitude determined; see https://help.waterdata.usgs.gov/code/alt_meth_cd_query?fmt=html
unique(gages_tidal_metadata$alt_meth_cd)
#> [1] NA  "L" "N" "D" "Y" "M"

I'm not sure if I can easily harmonize and compare these given altitudes with the elevations used in the NHDPlus VAA table(?). Given that, I tried instead to find maxelevsmo and minelevsmo for the COMIDs that are associated with these gages.

gages_tidal_nldi <- lapply(gages_tidal, function(x){
     dataRetrieval::findNLDI(nwis = x)
 }) %>% 
     bind_rows()
 gages_tidal_nldi <- gages_tidal_nldi$origin

 gages_tidal_flines <- nhdplusTools::get_nhdplus(comid = gages_tidal_nldi$comid[!is.na(gages_tidal_nldi$comid)],
                                                 realization = "flowline")
quantile(gages_tidal_flines$maxelevsmo/100, 
          c(0, 0.25, 0.5, 0.75, 0.9, 0.95, 1))
#>    0%   25%   50%   75%   90%   95%  100% 
#>  0.00  0.00  0.00  1.30  3.42  5.48 21.32

Using the COMIDs and taking the 95th percentile of maxelevsmo we'd end up with a similar cutoff of 5.5 meters. So I don't think this helps us refine the tidal reaches beyond what's included in the NHDPlus tidal attribute (option 1 above). The last thing I could try is to extract the elevation of each gage using a DEM and see what the range of values is.

jds485 commented 1 year ago

Thanks!

the time series for the past week (5/14/23 - 5/19/23) showed fairly obvious sub-daily fluctuations in gage height, and/o

Did you look at the timeseries for each of these sites? Only asking because of the and/or. I'm wondering if some of them don't have meaningful tidal changes (or any at all)

lekoenig commented 1 year ago

Did you look at the timeseries for each of these sites?

In the case of the site types, I took their word for it and did not look at the time series (just for this past week, so a limited data point). Otherwise, I did look at the time series and the gage remarks field.

jds485 commented 1 year ago

Okay, it sounds like we can safely use the 6 m NHD Tidal cutoff, then. I'll do a model run with those reaches removed for performance metric calculations

lekoenig commented 1 year ago

The last thing I could try is to extract the elevation of each gage using a DEM and see what the range of values is.

One more note - I wasn't able to post this on Friday because ScienceBase was temporarily unavailable, but I did grab the 3-m DEM's used by Hopkins et al. for their FACET model runs and extracted the elevation of gages_tidal in my comment above. I calculated the mean of the extracted raster values for a 30-m buffer around each of the tidal gages which returned the following distribution of gage elevations:

quantile(sites_elev$site_elev_m, na.rm = TRUE, c(0,0.25,0.5,0.75,0.85,0.95,1))
#>         0%        25%        50%        75%        85%        95%       100% 
#> -1.2991510 -0.9600000  0.5897652  1.1975898  2.1769435  3.3658545 19.3504091

The only elevation > 4 m is this site which has daily fluctuations in gage height, but these are <0.1 ft for the week starting May 16, 2023. So I'd feel comfortable using 4 m as a refined tidal cutoff if we chose to do that. Here's a plot of the "NHD tidal reaches" compared to these "refined tidal reaches:"

It looks like we'd gain a decent number of observation-days by using this refined elevation, but let me know what you think is best for the modeling workflow you have in mind:

targets::tar_load(p2_drb_comids_seg)
targets::tar_load(p1_reaches_sf)

# download flowlines that intersect the PRMS network
flines <- nhdplusTools::get_nhdplus(comid = p2_drb_comids_seg$comid, realization = "flowline")

# filter PRMS segments by two different tidal criteria
comids_tidal_refined <- filter(flines, maxelevsmo < 400, minelevsmo < 400, qe_ma > 0 ) %>%
  pull(comid)
comids_tidal_nhd <- flines$comid[flines$tidal == 1]

# identify NHM segments that overlap a "tidal" NHDPlus reach
segs_with_tidal_comids_nhd <- p2_drb_comids_seg |>
  filter(comid %in% comids_tidal_nhd) |>
  pull(PRMS_segid) |>
  unique()

segs_with_tidal_comids_refined <- p2_drb_comids_seg |>
  filter(comid %in% comids_tidal_refined) |>
  pull(PRMS_segid) |>
  unique()

# count number of observation-days
# 1) using NHDPlus definition of tidal reaches
targets::tar_load(p2_site_list_nontidal_csv)
readr::read_csv(p2_site_list_nontidal_csv, show_col_types = FALSE) |>
  filter(subsegid %in% segs_with_tidal_comids_nhd) |>
  pull(count_days_total) |>
  sum()
#> 63383

# 2) using refined tidal reaches
readr::read_csv(p2_site_list_nontidal_csv, show_col_types = FALSE) |>
  filter(subsegid %in% segs_with_tidal_comids_refined) |>
  pull(count_days_total) |>
  sum()
#> 41370

jds485 commented 1 year ago

Thanks for doing this! Let's use the refined 4 m elevation

jds485 commented 1 year ago

Closed from #242