Explore rainfall & sediment plume relationship

lindsayplatt commented 1 year ago

Setting this up as an exploration using the initial Yugo model output from July 6, which does not have shoreline contamination filtered. I think we should wait and see how these look with @steeleb's improved model output, but at first glance, the trend is maybe not what I would expect: decreases in sediment from 1980 to early 2000s but there is an increase starting around 2010 which is when bloom sightings started to pick up.

I'm not sure how much of the sediment in the 80s and 90s is influenced by lower resolution and quality on the satellites - you can see a difference in how LS5 and LS7 report sediment during their overlapping years in the 00s.

Below are some plots summarizing the percent sediment (% of the classifiable area that was sediment) and sediment area in hectares broken out by year and by mission. Not really much difference between using the % classifiable area and the actual sediment area.

Code to create first set of plots

```r library(tidyverse) yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% # Keep only the percentages for sediment select(mission, date, light_nearshore_pct = lightNSSed_perc, dark_nearshore_pct = darkNSSed_perc, offshore_pct = offShoreSed_perc) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct, year = year(date), year_mission = sprintf('%s_%s', year, mission)) p1<-ggplot(yugo, aes(x=date, y=sediment_total_pct)) + geom_smooth(color = 'grey') + geom_point(alpha = 0.5, aes(color = mission)) + theme_bw() + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment') p2<-ggplot(yugo, aes(y=sediment_total_pct, x=year, group=year_mission, fill=mission)) + geom_boxplot(position = position_dodge(preserve = "single")) + theme_bw() + ylab('Daily % sediment') + xlab('Year') + ggtitle('Boxplots per year per mission for daily % sediment') p3<-ggplot(yugo, aes(y=sediment_total_pct, x=year, group=year)) + geom_boxplot(fill = 'grey') + theme_bw() + ylab('Daily % sediment') + xlab('Year') + ggtitle('Boxplots per year for daily % sediment') p4<-yugo %>% group_by(year) %>% summarize(avg_sediment = mean(sediment_total_pct)) %>% ggplot(aes(x = year, y = avg_sediment)) + geom_smooth(color = 'grey') + geom_point() + theme_bw() + ylab('Average annual % sediment') + xlab('Year') + ggtitle('Annual timeseries of % sediment') cowplot::plot_grid(p1, p4, p2, p3, nrow = 2, ncol = 2) ```

Code to create second set of plots

```r library(tidyverse) # What if we did the same but with the `_ha` instead of %? yugo2 <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% # Keep only the areas for sediment select(mission, date, light_nearshore_ha = lightNSSed_ha, dark_nearshore_ha = darkNSSed_ha, offshore_ha = offShoreSed_ha) %>% mutate(sediment_total_ha = light_nearshore_ha + dark_nearshore_ha + offshore_ha, year = year(date), year_mission = sprintf('%s_%s', year, mission)) p1<-ggplot(yugo2, aes(x=date, y=sediment_total_ha)) + geom_smooth(color = 'grey') + geom_point(alpha = 0.5, aes(color = mission)) + theme_bw() + ylab('Daily sediment, ha') + xlab('Date') + ggtitle('Daily timeseries of sediment area') p2<-ggplot(yugo2, aes(y=sediment_total_ha, x=year, group=year_mission, fill=mission)) + geom_boxplot(position = position_dodge(preserve = "single")) + theme_bw() + ylab('Daily sediment, ha') + xlab('Year') + ggtitle('Boxplots per year per mission for daily sediment area') p3<-ggplot(yugo2, aes(y=sediment_total_ha, x=year, group=year)) + geom_boxplot(fill = 'grey') + theme_bw() + ylab('Daily sediment, ha') + xlab('Year') + ggtitle('Boxplots per year for daily sediment area') p4<-yugo2 %>% group_by(year) %>% summarize(avg_sediment = mean(sediment_total_ha)) %>% ggplot(aes(x = year, y = avg_sediment)) + geom_smooth(color = 'grey') + geom_point() + theme_bw() + ylab('Average annual sediment, ha') + xlab('Year') + ggtitle('Annual timeseries of sediment area') cowplot::plot_grid(p1, p4, p2, p3, nrow = 2, ncol = 2) ```

lindsayplatt commented 1 year ago

Plot timeseries of sediment percent using cloud percent for alpha to dull those with high cloud coverage.

Code to recreate the plot

```r yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% mutate(sediment_total_pct = lightNSSed_perc + darkNSSed_perc + offShoreSed_perc, sediment_total_ha = lightNSSed_ha + darkNSSed_ha + offShoreSed_ha) %>% select(mission, date, cloud_pct = cloud_perc, cloud_ha, sediment_total_pct, sediment_total_ha) %>% mutate(year = year(date), year_mission = sprintf('%s_%s', year, mission)) ggplot(yugo, aes(x=date, y=sediment_total_pct)) + geom_point(aes(fill = mission, alpha = cloud_pct), shape = 22, color = "transparent", size=2) + theme_bw() + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment') ```

Removing days with greater than 50% clouds did not alter the trends we were seeing before:

Code to recreate plot without clouds > 50%

```r library(tidyverse) yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% mutate(sediment_total_pct = lightNSSed_perc + darkNSSed_perc + offShoreSed_perc, sediment_total_ha = lightNSSed_ha + darkNSSed_ha + offShoreSed_ha) %>% select(mission, date, cloud_pct = cloud_perc, cloud_ha, sediment_total_pct, sediment_total_ha) %>% mutate(year = year(date), year_mission = sprintf('%s_%s', year, mission)) yugo %>% filter(cloud_pct < 50) %>% ggplot(aes(x=date, y=sediment_total_pct)) + geom_smooth(color = 'grey') + geom_point(aes(fill = mission), shape = 22, color = "transparent", size=2) + theme_bw() + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment, only clouds below 50%') ```

steeleb commented 1 year ago

COOL!

I think this will get much much better with the new models. At the very least LS 5/7, 8/9, then Sentinel will be in separate models in hopes to account for the radiometric resolution (and slight band wavelength) differences.

lindsayplatt commented 1 year ago

Terrific news - that's exactly what I was hoping to hear. Wanted to get rolling on some code and exploration to start on our end!

steeleb commented 1 year ago

Dumb question that I can probably answer with google...

HOW DO YOU MAKE THOSE FANCY 'code to recreate plot' THINGS?????

lindsayplatt commented 1 year ago

Surround your code with these:

<details>
<summary>THE TEXT YOU SEE</summary>

THE STUFF YOU WANT TO HIDE HERE (an empty line between the HTML is important if you have other markdown syntax)

</details>

That looks like this:

THE TEXT YOU SEE

THE STUFF YOU WANT TO HIDE HERE (an empty line between the HTML is important if you have other markdown syntax)

lindsayplatt commented 1 year ago

Wondering if monthly precip has similar dip in the 00s ... it is not as obvious.

Le plot code

```r library(targets) library(tidyverse) tar_read(p2_prism_data_huc) %>% filter(variable == 'ppt') %>% mutate(year = year(date), month = month(date), year.month = as.numeric(sprintf('%s.%s', year, month))) %>% group_by(huc, year.month) %>% summarize(precip_total = sum(value_huc), precip_avg = mean(value_huc)) %>% # pivot_longer(cols = c('precip_avg', 'precip_total')) %>% ggplot(aes(x=year.month, y=precip_total)) + geom_point(aes(color = huc), shape = 1) + geom_smooth(color = 'black') + theme_bw() + facet_grid(huc ~ ., scales = 'free') + ylab('Monthly precip, mm') + xlab('Date') + ggtitle('Timeseries of total monthly precipitation') ```

steeleb commented 1 year ago

Not obvious, but kinda looks like it's there??? Like, if you squint?!

lindsayplatt commented 1 year ago

Trying to see if there are trends for individual storm events. Using the 95th percentile daily flow to represent an intense rainfall event for now (inspired by EPA's technical guidance for stormwater control).

If you look at trends by season and HUC for the "intense rainfall" events, I see a dip in the 00s during summer which may help explain that dip in sediment trends ~since we have more summer dates than other times~ (EDIT: this is not true! They are all fairly equal). I will explore sediment trends by season next.

Code for plots and analysis!

```r library(targets) library(tidyverse) add_storm_ppt <- function(all_ppt, storm_percentile = 0.95) { quantile(all_ppt, probs = storm_percentile) %>% unname() } # Quantifying "95th percentile rainfall event" as severe storm # https://www.epa.gov/sites/default/files/2015-08/documents/epa_swm_guidance.pdf storm_perc <- 0.95 # Filter data to only days with storm level rainfall ppt_storm <- tar_read(p2_prism_data_huc) %>% filter(variable == 'ppt') %>% group_by(huc) %>% mutate(ppt_huc_storm = add_storm_ppt(value_huc, storm_perc)) %>% ungroup() %>% filter(value_huc >= ppt_huc_storm) %>% mutate(year.month = as.numeric(sprintf('%s.%s', year, month)), season = ordered(season, levels = c('Winter', 'Spring', 'Summer', 'Fall')), season_num = as.numeric(season), year.season = as.numeric(sprintf('%s.%s', year, season_num))) p1 <- ggplot(ppt_storm, aes(x=date, y=value_huc)) + geom_point(aes(color = season), shape = 1) + geom_smooth(color = 'black') + theme_bw() + facet_grid(huc ~ season, scales = 'free') + ylab('Daily precip, mm') + xlab('Date') + ggtitle(sprintf('Timeseries of storm event rainfall totals (>= %sth percentile)', storm_perc*100)) + theme(legend.position='none') p2 <- ppt_storm %>% group_by(huc, year, season) %>% tally() %>% ggplot(aes(x=year, y = n, fill = season)) + geom_bar(stat = 'identity', position = position_dodge()) + geom_smooth(color = 'black') + theme_bw() + facet_grid(huc ~ season, scales = 'free') + ylab('Number of storm events') + xlab('Date') + ggtitle(sprintf('Timeseries of storm event occurrence (>= %sth percentile)', storm_perc*100)) cowplot::plot_grid(p1, p2, nrow = 1, ncol = 2) ```

lindsayplatt commented 1 year ago

Lining up the sediment trends by season:

THE CODE!

```r library(tidyverse) # Pass in a vector of month numbers (1 thru 12) and get a # vector of the same size with the season back. month_to_season <- function(month_num) { # Create vector with season as the value and month number as the name month_season <- setNames(rep(c("Winter", "Spring", "Summer", "Fall"), each = 3), c(12, 1:11)) # Arrange so that DEC is in the 12th spot month_season <- month_season[order(as.numeric(names(month_season)))] # Pull out the season based on month number as the indices season_out <- month_season[month_num] names(season_out) <- NULL # Drop the names attribute return(season_out) } yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% mutate(sediment_total_pct = lightNSSed_perc + darkNSSed_perc + offShoreSed_perc, sediment_total_ha = lightNSSed_ha + darkNSSed_ha + offShoreSed_ha) %>% select(mission, date, cloud_pct = cloud_perc, cloud_ha, sediment_total_pct, sediment_total_ha) %>% mutate(year = year(date), year_mission = sprintf('%s_%s', year, mission), season = ordered(month_to_season(month(date)), levels = c('Winter', 'Spring', 'Summer', 'Fall'))) p1 <- ggplot(yugo, aes(x=date, y=sediment_total_pct)) + geom_point(aes(alpha = cloud_pct, color = season, fill = season), shape = 22, color = "transparent", size=2) + theme_bw() + facet_grid(. ~ season) + geom_smooth(color = 'black') + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment') p2 <- yugo %>% filter(cloud_pct < 50) %>% ggplot(aes(x=date, y=sediment_total_pct, color = season, fill = season)) + geom_smooth(color = 'black') + geom_point(shape = 22, color = "transparent", size=2) + theme_bw() + facet_grid(. ~ season) + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment, only clouds below 50%') p3 <- yugo %>% group_by(year, season) %>% summarize(avg_sediment = mean(sediment_total_pct)) %>% ggplot(aes(x = year, y = avg_sediment, color = season, fill = season)) + geom_smooth(color = 'black') + geom_point() + theme_bw() + facet_grid(. ~ season) + ylab('Average annual % sediment') + xlab('Year') + ggtitle('Annual timeseries of % sediment') cowplot::plot_grid(p1, p2, p3, nrow = 3, ncol = 1) ```

lindsayplatt commented 1 year ago

Initial correlation between sediment and precip for the known bloom dates.

There are ~1100 dates from the Landsat data between 1984 and 2023. When we match up the 33 known bloom dates (using a 3-day window for each bloom date), we only get 13 matches from the Landsat data. The correlation between precip and sediment area on those 13 dates results in a negative relationship. The more recent updates from https://github.com/rossyndicate/Superior-Plume-Bloom/pull/64, resulted in a slightly less negative relationship (right figure).

Left fig = earlier model from quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv file. Right fig = most recent model from GTB_by_LS_mission_pass_v2023-07-25.csv.

And the code, of course!

```r library(targets) library(tidyverse) library(ggpubr) # Load observed bloom dates & climate data tar_load(names = c( 'p2_obs_blooms_details', 'p2_prism_data_huc' )) # Load classified plume presence data # yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% yugo <- read_csv('GTB_by_LS_mission_pass_v2023-07-25.csv') %>% # Keep only the percentages for sediment select(mission, date, light_nearshore_pct = lightNSSed_perc, dark_nearshore_pct = darkNSSed_perc, offshore_pct = offShoreSed_perc) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) # Identify exact bloom dates exact_bloom_dates <- unique(p2_obs_blooms_details$`Start Date`) # Create n-day window leading up to each bloom days_ahead <- 3 window_bloom_dates <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date - days_ahead, date, by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # For now, total rainfall for all hucs ppt <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% group_by(date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% inner_join(window_bloom_dates, by = 'date') %>% # For each bloom event, total up the precipitation group_by(bloom_date) %>% summarize(ppt_bloom_total = sum(ppt_huc_total)) # Average the sediment percentage if more than one mission is present sediment <- yugo %>% group_by(date) %>% summarize(sediment_total_pct_missionavg = mean(sediment_total_pct)) %>% inner_join(window_bloom_dates, by = 'date') # Join the sediment percentage per bloom date to the precip value per bloom date ppt_sediment <- ppt %>% full_join(sediment, by = 'bloom_date') # Correlation plot ggpubr::ggscatter( ppt_sediment, x = "ppt_bloom_total", y = "sediment_total_pct_missionavg", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "kendall", xlab = sprintf("%s-day cumulative precipitation leading up to bloom date", days_ahead), ylab = "Percent of AOI classified as sediment", ylim = c(-50, 100) ) ```

lindsayplatt commented 1 year ago

I adjusted the approach above slightly, so that the sediment values for each bloom date included 3 days ahead AND 3 days after. Barely any correlation but NOT NEGATIVE! The sediment values plotted along the y axis are the average sediment percentages across the 7-days surrounding a bloom (averaged only when we had multiple sediment values that fell into the window).

Still waiting to really dig in and scrutinize these results after the shoreline contamination pieces are removed (https://github.com/rossyndicate/Superior-Plume-Bloom/issues/63) and looking forward to the Sentinel data to get added, not just Landsat (https://github.com/rossyndicate/Superior-Plume-Bloom/issues/61).

Code in here

```r library(targets) library(tidyverse) library(ggpubr) # Load observed bloom dates & climate data tar_load(names = c( 'p2_obs_blooms_details', 'p2_prism_data_huc' )) # Load classified plume presence data # yugo <- read_csv('quick_gradientTreeBoost_landsat_stack_v2023-07-06.csv') %>% yugo <- read_csv('GTB_by_LS_mission_pass_v2023-07-25.csv') %>% # Keep only the percentages for sediment select(mission, date, light_nearshore_pct = lightNSSed_perc, dark_nearshore_pct = darkNSSed_perc, offshore_pct = offShoreSed_perc) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) # Identify exact bloom dates exact_bloom_dates <- unique(p2_obs_blooms_details$`Start Date`) # Create n-day window leading up to each bloom to capture rainfall days_ahead <- 3 window_bloom_dates_before <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date - days_ahead, date, by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Create n-day window after each bloom to capture sediment plumes days_after <- 3 window_bloom_dates_after <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date, date + days_after , by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Combine all the dates window_bloom_dates_all <- window_bloom_dates_before %>% bind_rows(window_bloom_dates_after) %>% distinct() # For now, total rainfall for all hucs ppt <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% group_by(date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% inner_join(window_bloom_dates_before, by = 'date') %>% # For each bloom event, total up the precipitation group_by(bloom_date) %>% summarize(ppt_bloom_total = sum(ppt_huc_total)) # Average the sediment percentage if more than one mission is present sediment <- yugo %>% group_by(date) %>% summarize(sediment_total_pct_missionavg = mean(sediment_total_pct)) %>% inner_join(window_bloom_dates_all, by = 'date') %>% # For each bloom event, find the maximum sediment group_by(bloom_date) %>% summarize(sediment_pct_max = mean(sediment_total_pct_missionavg)) # Join the sediment percentage per bloom date to the precip value per bloom date ppt_sediment <- ppt %>% full_join(sediment, by = 'bloom_date') # Correlation plot ggpubr::ggscatter( ppt_sediment, x = "ppt_bloom_total", y = "sediment_pct_max", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "kendall", xlab = sprintf("%s-day cumulative precipitation leading up to bloom date", days_ahead), ylab = "Percent of AOI classified as sediment", ylim = c(-50, 100) ) ```

steeleb commented 1 year ago

Suhweet.

Just so you have an idea of timeline, I'm working on the Sentinel right now. I have some output, but even though the accuracy was very high for Sen2, the applied data look absolutely terrible (labeled as mostly clouds). I'm hoping it was just a fat finger somewhere along the way. Hoping to un-stuck it by EOD. I'll include the shoreline contamination summaries after I get the Sen2 stuff figured out.

Something to keep in mind (and I haven't dug in to your code to look and see if you've dealt with this at all) not all of the bloom observations are within our AOI. Some are in strange bays on the North shore, some are in the harbor, etc. Just something to think about as you're playing with the obs data.

Another couple of thoughts: I'm not super familiar with flow patterns, but maybe also look out at 7 days prior cumulative precip? And/or only including 'storm events' in the cumulative data which I think is >2cm over 24h?

lindsayplatt commented 1 year ago

not all of the bloom observations are within our AOI

Ahhhh I had not taken that into account. Thanks for mentioning it!

also look out at 7 days prior cumulative precip

I had done this and 5 days prior without any noticeable difference.

only including 'storm events' in the cumulative data which I think is >2cm over 24h

Hmm now this is interesting. I used this sort of approach when just exploring precip trends a few comments above (see https://github.com/LimnoDataScience/plume_bloom_drivers/issues/11#issuecomment-1650062482) so I could pretty easily apply that here! Lemme see what I can do in this vein.

lindsayplatt commented 1 year ago

Doesn't seem like including "storm only" values changes things much but I did try it with 1, 3, 5 and 7 days ahead. I think I am going to explore individual HUC contributions next.

Code for what changed to make this "storm only" view

```r add_storm_ppt <- function(all_ppt, storm_percentile = 0.95) { quantile(all_ppt, probs = storm_percentile) %>% unname() } # For now, total only storm rainfall for all hucs ppt <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% # GROUP BY HUC AND CALC STORM VALUE group_by(huc) %>% mutate(ppt_huc_storm = add_storm_ppt(value_huc, 0.95)) %>% # Filter to only storms filter(value_huc >= ppt_huc_storm) %>% # Now sum the HUC values into one group_by(date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% inner_join(window_bloom_dates_before, by = 'date') %>% # For each bloom event, total up the precipitation group_by(bloom_date) %>% summarize(ppt_bloom_total = sum(ppt_huc_total)) ```

lindsayplatt commented 1 year ago

Here I am using the new data (see https://github.com/rossyndicate/Superior-Plume-Bloom/pull/67) and using the columns *_perc_extent_area when extent == 'aoi no sc' values, meaning the sediment values used are percentages of the AOI without any shoreline contamination.

Results using 3-, 5-, and 7-day ahead cumulative storm precipitation values.

3-day ahead cumulative storm precip

5-day ahead cumulative storm precip

7-day ahead cumulative storm precip

Code for this plot

```r library(targets) library(tidyverse) library(ggpubr) # Load observed bloom dates & climate data tar_load(names = c( 'p2_obs_blooms_details', 'p2_prism_data_huc' )) # Load classified plume presence data yugo <- read_csv('GTB_by_mission_v2023-08-03.csv') %>% filter(extent == 'aoi no sc') %>% # Keep only the percentages for sediment select(mission, date, light_nearshore_pct = lightNSSed_perc_extent_area, dark_nearshore_pct = darkNSSed_perc_extent_area, offshore_pct = offShoreSed_perc_extent_area) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) # Identify exact bloom dates exact_bloom_dates <- unique(p2_obs_blooms_details$`Start Date`) # Create n-day window leading up to each bloom to capture rainfall days_ahead <- 5 window_bloom_dates_before <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date - days_ahead, date, by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Create n-day window after each bloom to capture sediment plumes days_after <- 3 window_bloom_dates_after <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date, date + days_after , by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Combine all the dates window_bloom_dates_all <- window_bloom_dates_before %>% bind_rows(window_bloom_dates_after) %>% distinct() add_storm_ppt <- function(all_ppt, storm_percentile = 0.95) { quantile(all_ppt, probs = storm_percentile) %>% unname() } # For now, total only storm rainfall for all hucs ppt <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% # GROUP BY HUC AND CALC STORM VALUE group_by(huc) %>% mutate(ppt_huc_storm = add_storm_ppt(value_huc, 0.95)) %>% # Filter to only storms filter(value_huc >= ppt_huc_storm) %>% # Now sum the HUC values into one group_by(huc, date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% inner_join(window_bloom_dates_before, by = 'date') %>% # For each bloom event, total up the precipitation group_by(huc, bloom_date) %>% summarize(ppt_bloom_total = sum(ppt_huc_total)) # Average the sediment percentage if more than one mission is present sediment <- yugo %>% group_by(date) %>% summarize(sediment_total_pct_missionavg = mean(sediment_total_pct)) %>% inner_join(window_bloom_dates_all, by = 'date') %>% # For each bloom event, find the maximum sediment group_by(bloom_date) %>% summarize(sediment_pct_max = mean(sediment_total_pct_missionavg)) # Join the sediment percentage per bloom date to the precip value per bloom date ppt_sediment <- ppt %>% full_join(sediment, by = 'bloom_date') # Correlation plot as fxn correlate_sed_ppt <- function(in_data, in_title) { ggpubr::ggscatter( in_data, x = "ppt_bloom_total", y = "sediment_pct_max", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "kendall", xlab = sprintf("%s-day cumulative storm precip leading up to bloom date", days_ahead), ylab = "% sediment (average % of AOI without shoreline\ncontamination across missions that is sediment)", ylim = c(-50, 100), title = in_title ) } huc_plot_list <- ppt_sediment %>% split(.$huc) %>% purrr::map(~correlate_sed_ppt(.x, sprintf('HUC %s', unique(.x$huc)))) cowplot::plot_grid(plotlist=huc_plot_list) ```

lindsayplatt commented 1 year ago

Exploring sediment trends using percent extent without sediment contamination!

Data-processing code for these plots

Only showing the data loading code chunk. This is the main thing that differs from the code in https://github.com/LimnoDataScience/plume_bloom_drivers/issues/11#issuecomment-1650089988 Note that the previous plots had the alpha scale backwards for clouds. Points should have been getting _more_ transparent with higher `cloud_pct` values, so in the plots here I just passed in a negative cloud percent value as a quick way to switch it. ```r yugo <- read_csv('GTB_by_mission_v2023-08-03.csv') %>% filter(extent == 'aoi no sc') %>% # Keep only the percentages for sediment select(mission, date, cloud_pct = cloud_perc_extent_area, light_nearshore_pct = lightNSSed_perc_extent_area, dark_nearshore_pct = darkNSSed_perc_extent_area, offshore_pct = offShoreSed_perc_extent_area) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) %>% mutate(year = year(date), year_mission = sprintf('%s_%s', year, mission), season = ordered(month_to_season(month(date)), levels = c('Winter', 'Spring', 'Summer', 'Fall'))) ```

lindsayplatt commented 1 year ago

Precip-Q relationship check. Not looking great :( But maybe using flow instead of precip in the sediment correlations will help. EDIT: added scales="free_y" and redid the plots. Also, there will be a lag between precip and flow, especially in bigger watersheds + snowmelt might be throwing these relationships off. Will explore that next.

The code

```r library(targets) library(tidyverse) library(dataRetrieval) tar_load(p1_nwis_sites) tar_load(p2_prism_data_huc) # Manually match HUC10s to the NWIS outlets # tar_read(p1_huc10_nwis_sites) %>% select(huc10, name) %>% sf::st_drop_geometry() site_huc_xwalk <- p1_nwis_sites %>% mutate(HUC10 = c("0401030105", "0401030107", "0401030109", "0401020115")) site_huc_xwalk add_storm_ppt <- function(all_ppt, storm_percentile = 0.95) { quantile(all_ppt, probs = storm_percentile) %>% unname() } precip <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% # GROUP BY HUC AND CALC STORM VALUE group_by(huc) %>% mutate(ppt_huc_storm = add_storm_ppt(value_huc, 0.95)) %>% # Filter to only storms # filter(value_huc >= ppt_huc_storm) %>% # Now sum the HUC values into one group_by(huc, date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% select(HUC10 = huc, date, ppt = ppt_huc_total) nwis_q_data <- readNWISdv(siteNumber = site_huc_xwalk$nwis_site, startDate = min(precip$date), endDate = max(precip$date), parameterCd = '00060') %>% renameNWISColumns() %>% select(nwis_site = site_no, date = Date, flow = Flow) ppt_q_comparison_data <- precip %>% left_join(site_huc_xwalk, by = "HUC10") %>% left_join(nwis_q_data, by = c("nwis_site", "date")) %>% mutate(facet_nm = sprintf('%s (NWIS site: %s)', river, nwis_site)) sum(is.na(ppt_q_comparison_data$flow)) # 44,672 NAs ggplot(ppt_q_comparison_data, aes(x = ppt, y = flow, color = date)) + geom_point(alpha = 0.70) + facet_wrap(vars(facet_nm), nrow=2, scales="free_y") + theme_bw() #+ # ggtitle('Storm only PPT matched to Q') ```

lindsayplatt commented 1 year ago

Checking out sediment occurrence and discharge relationship since the precip-Q relationship is not as we expected. The plots display the river name for the HUC10 and NWIS site. They do match the arrangement of the previous correlation plots that display the HUC10 names.

3-day max discharge ahead of a bloom

5-day max discharge ahead of a bloom

7-day max discharge ahead of a bloom

Code

```r library(targets) library(tidyverse) library(ggpubr) # Load observed bloom dates & discharge data tar_load(names = c( 'p2_obs_blooms_details', 'p1_nwis_Q', 'p1_nwis_sites' )) # Load classified plume presence data yugo <- read_csv('GTB_by_mission_v2023-08-03.csv') %>% filter(extent == 'aoi no sc') %>% # Keep only the percentages for sediment select(mission, date, light_nearshore_pct = lightNSSed_perc_extent_area, dark_nearshore_pct = darkNSSed_perc_extent_area, offshore_pct = offShoreSed_perc_extent_area) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) # Identify exact bloom dates exact_bloom_dates <- unique(p2_obs_blooms_details$`Start Date`) # Create n-day window leading up to each bloom to capture discharge days_ahead <- 7 window_bloom_dates_before <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date - days_ahead, date, by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Create n-day window after each bloom to capture sediment plumes days_after <- 3 window_bloom_dates_after <- map(exact_bloom_dates, function(date) { tibble(bloom_date = date, date = seq(date, date + days_after , by = 1)) }) %>% bind_rows() %>% distinct() %>% arrange(bloom_date) %>% # For bloom dates that may overlap, only keep the early bloom id arrange(date) %>% group_by(date) %>% filter(bloom_date == min(bloom_date)) # Combine all the dates window_bloom_dates_all <- window_bloom_dates_before %>% bind_rows(window_bloom_dates_after) %>% distinct() # For now, total only storm rainfall for all hucs discharge <- p1_nwis_Q %>% # For each bloom event, average the discharge ahead of the bloom inner_join(window_bloom_dates_before, by = 'date') %>% group_by(nwis_site, bloom_date) %>% summarize(bloom_Q = max(Q)) # Average the sediment percentage if more than one mission is present sediment <- yugo %>% group_by(date) %>% summarize(sediment_total_pct_missionavg = mean(sediment_total_pct)) %>% inner_join(window_bloom_dates_all, by = 'date') %>% # For each bloom event, find the maximum sediment group_by(bloom_date) %>% summarize(sediment_pct_max = mean(sediment_total_pct_missionavg)) # Join the sediment percentage per bloom date to the precip value per bloom date discharge_sediment <- discharge %>% full_join(sediment, by = 'bloom_date') # Correlation plot as fxn correlate_sed_discharge <- function(in_data, in_title) { ggpubr::ggscatter( in_data, x = "bloom_Q", y = "sediment_pct_max", add = "reg.line", conf.int = TRUE, cor.coef = TRUE, cor.method = "kendall", xlab = sprintf("%s-day max discharge leading up to bloom date, cfs", days_ahead), ylab = "% sediment (average % of AOI without shoreline\ncontamination across missions that is sediment)", ylim = c(-50, 100), title = in_title ) } huc_plot_list <- discharge_sediment %>% left_join(p1_nwis_sites, by = "nwis_site") %>% split(.$nwis_site) %>% purrr::map(~correlate_sed_discharge(.x, sprintf('%s', unique(.x$river)))) cowplot::plot_grid(plotlist=huc_plot_list) ```

steeleb commented 1 year ago

All the rasters have exported, so I'm going to make those annual stacks that we talked about before and get them up on HydroShare (hopefully today). I'm guessing some sort of localized summary (of the area around the mouth of these rivers) will be more representative than the gigantic AOI we are currently using. We could probably start with just a buffer or 1000m or so around the point where these drain into Superior. I'd be surprised if we didn't see any patterns with a more localized approach.

steeleb commented 1 year ago

Exploring sediment trends using percent extent without sediment contamination!

Data-processing code for these plots

Suggestion: Consider also limiting your plots to some percentage of aoi that was classified.

steeleb commented 1 year ago

For precip:Q -

Could you break these out by season to see if there is a better/different relationship that could account for varying hydrologic processes?

lindsayplatt commented 1 year ago

Ha - you're jumping ahead :) Up in this comment (https://github.com/LimnoDataScience/plume_bloom_drivers/issues/11#issuecomment-1671709417) I've got a note about snowmelt potentially causing the funny relationship. I am jumping onto that soon :)

As for limiting to some percentage of aoi that was classified in those precip plots: I understood the *_perc_extent_area columns to mean the classifiable AOI? For example, I am using extent == 'aoi no sc' and assumed my percentage is now based on an area equal to the classifiable AOI minus sediment contamination. Is that right?

lindsayplatt commented 1 year ago

Excited to bring the rasters back into the workflow. So cool you figured something out with the export to make this reasonable! I agree that having some sort of region at the mouth of each river will be better. I wonder if we can actually statistically link cells to each river based on some kind of clustering algorithm using the sediment timeseries and the discharge?

steeleb commented 1 year ago

Ha - you're jumping ahead :) Up in this comment (#11 (comment)) I've got a note about snowmelt potentially causing the funny relationship. I am jumping onto that soon :)

Your comment is exactly why I suggested it - I'm so curious.

As for limiting to some percentage of aoi that was classified in those precip plots: I understood the *_perc_extent_area columns to mean the classifiable AOI? For example, I am using extent == 'aoi no sc' and assumed my percentage is now based on an area equal to the classifiable AOI minus sediment contamination. Is that right?

This is right, I'm mostly thinking about limiting the input summaries to those that have 50% of the total area classified. That would be using the column 'perc_area_classified'. There are many images that only have a small portion of the AOI classified at all - each image date doesn't cover the same area.

steeleb commented 1 year ago

Excited to bring the rasters back into the workflow. So cool you figured something out with the export to make this reasonable! I agree that having some sort of region at the mouth of each river will be better. I wonder if we can actually statistically link cells to each river based on some kind of clustering algorithm using the sediment timeseries and the discharge?

I honestly have no idea how I made it so much more efficient. Maybe it has something to do with the .geo column, because that's gone in the previous PR update too, and I have no idea why.

I would think about this even more simply, since the sediment all kind of becomes a single feature (except that dark near shore sediment, which only happens consistently in 1 of these 4 streams, I think) and clustering wouldn't necessarily work. I'd literally, for a first pass, just use a single point with a 1000m buffer and summarize the data in that area. We can get more nuanced if we start to actually see patterns emerge.

lindsayplatt commented 1 year ago

Ahhhh yes, OK. I see what you're saying now. 758 rows were dropped when I added filter(perc_extent_classified >= 50).

Le code

```r library(tidyverse) # Pass in a vector of month numbers (1 thru 12) and get a # vector of the same size with the season back. month_to_season <- function(month_num) { # Create vector with season as the value and month number as the name month_season <- setNames(rep(c("Winter", "Spring", "Summer", "Fall"), each = 3), c(12, 1:11)) # Arrange so that DEC is in the 12th spot month_season <- month_season[order(as.numeric(names(month_season)))] # Pull out the season based on month number as the indices season_out <- month_season[month_num] names(season_out) <- NULL # Drop the names attribute return(season_out) } yugo <- read_csv('GTB_by_mission_v2023-08-03.csv') %>% filter(extent == 'aoi no sc') %>% # Remove dates and missions where <50% of the AOI # scene was able to be classified filter(perc_extent_classified >= 50) %>% # Keep only the percentages for sediment select(mission, date, cloud_pct = cloud_perc_extent_area, light_nearshore_pct = lightNSSed_perc_extent_area, dark_nearshore_pct = darkNSSed_perc_extent_area, offshore_pct = offShoreSed_perc_extent_area) %>% mutate(sediment_total_pct = light_nearshore_pct + dark_nearshore_pct + offshore_pct) %>% mutate(year = year(date), year_mission = sprintf('%s_%s', year, mission), season = ordered(month_to_season(month(date)), levels = c('Winter', 'Spring', 'Summer', 'Fall'))) p1 <- ggplot(yugo, aes(x=date, y=sediment_total_pct)) + geom_point(aes(alpha = cloud_pct, color = season, fill = season), shape = 22, color = "transparent", size=2) + theme_bw() + facet_grid(. ~ season) + geom_smooth(color = 'black') + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment') p2 <- yugo %>% filter(cloud_pct < 50) %>% ggplot(aes(x=date, y=sediment_total_pct, color = season, fill = season)) + geom_smooth(color = 'black') + geom_point(shape = 22, color = "transparent", size=2) + theme_bw() + facet_grid(. ~ season) + ylab('Daily % sediment') + xlab('Date') + ggtitle('Daily timeseries of % sediment, only clouds below 50%') p3 <- yugo %>% group_by(year, season) %>% summarize(avg_sediment = mean(sediment_total_pct)) %>% ggplot(aes(x = year, y = avg_sediment, color = season, fill = season)) + geom_smooth(color = 'black') + geom_point() + theme_bw() + facet_grid(. ~ season) + ylab('Average annual % sediment') + xlab('Year') + ggtitle('Annual timeseries of % sediment') cowplot::plot_grid(p1, p2, p3, nrow = 3, ncol = 1) ```

steeleb commented 1 year ago

Takes out a smidge of noise!

And this is my fault for not providing good metadata for you. This is what happens when I'm rushing. ;)

lindsayplatt commented 1 year ago

Oh no worries! We're getting there. I'm also not doing code reviews like you have done right now, which is where you would probably suggest something like that. Wanted to get a handle on where we were going first but probably should have transitioned by now.

lindsayplatt commented 1 year ago

Alrighty, let's look at Precip-Q by season. This makes more sense to me

All precip-Q dates

Only storm precip-Q dates

Code for these figs

```r # TODO: pull accompanying NWIS discharge and gut PPT trends/relationship per HUC. library(targets) library(tidyverse) library(dataRetrieval) tar_load(p1_huc10_nwis_sites) tar_load(p1_nwis_sites) tar_load(p1_nwis_Q) tar_load(p2_prism_data_huc) site_huc_xwalk <- p1_huc10_nwis_sites %>% st_drop_geometry() %>% select(nwis_site, HUC10 = huc10) %>% left_join(p1_nwis_sites, by = "nwis_site") add_storm_ppt <- function(all_ppt, storm_percentile = 0.95) { quantile(all_ppt, probs = storm_percentile) %>% unname() } # Pass in a vector of month numbers (1 thru 12) and get a # vector of the same size with the season back. month_to_season <- function(month_num) { # Create vector with season as the value and month number as the name month_season <- setNames(rep(c("Winter", "Spring", "Summer", "Fall"), each = 3), c(12, 1:11)) # Arrange so that DEC is in the 12th spot month_season <- month_season[order(as.numeric(names(month_season)))] # Pull out the season based on month number as the indices season_out <- month_season[month_num] names(season_out) <- NULL # Drop the names attribute return(season_out) } precip <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% # GROUP BY HUC AND CALC STORM VALUE group_by(huc) %>% mutate(ppt_huc_storm = add_storm_ppt(value_huc, 0.95)) %>% # Filter to only storms # filter(value_huc >= ppt_huc_storm) %>% # Now sum the HUC values into one group_by(huc, date) %>% summarize(ppt_huc_total = sum(value_huc)) %>% select(HUC10 = huc, date, ppt = ppt_huc_total) %>% mutate(season = ordered(month_to_season(month(date)), levels = c('Winter', 'Spring', 'Summer', 'Fall'))) ppt_q_comparison_data <- precip %>% left_join(site_huc_xwalk, by = "HUC10") %>% left_join(p1_nwis_Q, by = c("nwis_site", "date")) %>% mutate(facet_nm = sprintf('%s (NWIS site: %s)', river, nwis_site)) sum(is.na(ppt_q_comparison_data$Q)) # 44,672 NAs ggplot(ppt_q_comparison_data, aes(x = ppt, y = Q, color = season)) + geom_point(alpha = 0.70) + facet_grid(river ~ season, scales="free_y") + geom_smooth(color = 'black') + ylab('Daily mean discharge, ft^3/sec') + xlab('Total daily precip, mm') + theme_bw() #+ # ggtitle('Storm only PPT matched to Q') ```

steeleb commented 1 year ago

A million more times sensical!

And no worries about code review - when doing a lot of basic exploratory work, this is almost better on my end.

lindsayplatt commented 1 year ago

I've been exploring the lags between discharge and precip for each site. Based on this analysis:

Bois Brule Q is more correlated with PPT up to 3 days ago
Nemadji Q is more correlated with PPT up to 2 days ago
Siskiwit Q is correlated to the current day's PPT
St. Louis Q is more correlated with PPT up to 5 days ago

Next, I am going to adjust my previous precip-Q relationships.

Code hiding here per usual

```r library(targets) library(tidyverse) library(sf) tar_load(p1_nwis_sites) tar_load(p1_huc10_nwis_sites) tar_load(p1_nwis_Q) tar_load(p2_prism_data_huc) site_huc_xwalk <- p1_huc10_nwis_sites %>% st_drop_geometry() %>% select(nwis_site, HUC10 = huc10) %>% left_join(p1_nwis_sites, by = "nwis_site") Q <- p1_nwis_Q %>% left_join(site_huc_xwalk, by = "nwis_site") precip <- p2_prism_data_huc %>% filter(variable == 'ppt') %>% rename(HUC10 = huc, ppt = value_huc) %>% left_join(site_huc_xwalk, by = "HUC10") precip_Q <- precip %>% left_join(Q, by = c("river", "nwis_site", "HUC10", "date")) # Run cross correlation per site/river ccf_ppt_q_byriver <- precip_Q %>% filter(!is.na(Q), !is.na(ppt)) %>% split(.$river) %>% # purrr::map(~ccf(.x$Q, .x$ppt)) purrr::map(~ccf(.x$ppt, .x$Q, plot = F, ci = 0.99)) par(mfrow = c(2,2)) for(i in 1:length(ccf_ppt_q_byriver)) { plot(ccf_ppt_q_byriver[[i]], main = names(ccf_ppt_q_byriver)[i]) arrows(x0 = 0, y0 = -0.1, y1 = 0, col='red', lwd=2, length=0.1) } par(mfrow = c(1,1)) # Extract the maximum lag above a correlation value of 0.05 statsig_lags <- ccf_ppt_q_byriver %>% purrr::map(function(x) { blue_line_val <- qnorm((1 + 0.99)/2) / sqrt(x$n.used) lags_above <- x$lag[x$acf > 0.20] # Choose the biggest lag (negative or positive) lag_out <- lags_above[which.max(abs(lags_above))] return(lag_out) }) statsig_lags ```

lindsayplatt commented 1 year ago

It doesn't seem like adding the lags really made a huge difference. The storm filter made a bigger impact. If we were summing, maybe it would. I think using the lags in our analysis as we continue is still a good idea though because it is a characteristic of basin shape/size and also stream gage placement.

What I did was shift the Q to match with the precip from {lag} days earlier and then plot Q vs PPT. Filtered to only precip storm days, too. Note that Siskiwit plots should not change when lag is added because their lag = 0.

Code to add on to previous comment's code

```r # Apply the lag to the precip-Q relationship part. ppt_q_comparison_data <- precip_Q %>% # GROUP BY HUC AND CALC STORM VALUE group_by(HUC10) %>% mutate(ppt_huc_storm = add_storm_ppt(ppt, 0.95)) %>% ungroup() %>% # Need to add a column that represents the precip if you include the lag per river # This method assumes that `split(.$river)` and `statsig_lags` have rivers in the same order split(.$river) %>% purrr::map2(statsig_lags, ~mutate(.x, ppt_lag = lag(ppt, abs(.y)))) %>% bind_rows() %>% # Change season to factor for plotting mutate(season = ordered(month_to_season(month(date)), levels = c('Winter', 'Spring', 'Summer', 'Fall'))) p <- ggplot(ppt_q_comparison_data, aes(x = ppt, y = Q, color = season)) + geom_point(alpha = 0.70) + facet_grid(river ~ season, scales="free_y") + geom_smooth(color = 'black') + ylab('Daily mean discharge, ft^3/sec') + xlab('Total daily precip, mm') + theme_bw() + ggtitle('Q-PPT relationship') p_storm <- ppt_q_comparison_data %>% # Filter to only storms filter(ppt >= ppt_huc_storm) %>% ggplot(aes(x = ppt, y = Q, color = season)) + geom_point(alpha = 0.70) + facet_grid(river ~ season, scales="free_y") + geom_smooth(color = 'black') + ylab('Daily mean discharge, ft^3/sec') + xlab('Total daily precip, mm') + theme_bw() + ggtitle('Q-PPT relationship, only storms') p_lag <- ggplot(ppt_q_comparison_data, aes(x = ppt_lag, y = Q, color = season)) + geom_point(alpha = 0.70) + facet_grid(river ~ season, scales="free_y") + geom_smooth(color = 'black') + ylab('Daily mean discharge, ft^3/sec') + xlab('Total daily precip, mm') + theme_bw() + ggtitle('Q-PPT relationship with lag') p_lag_storm <- ppt_q_comparison_data %>% # Filter to only storms filter(ppt_lag >= ppt_huc_storm) %>% ggplot(aes(x = ppt_lag, y = Q, color = season)) + geom_point(alpha = 0.70) + facet_grid(river ~ season, scales="free_y") + geom_smooth(color = 'black') + ylab('Daily mean discharge, ft^3/sec') + xlab('Total daily precip, mm') + theme_bw() + ggtitle('Q-PPT relationship with lag, only storms') cowplot::plot_grid(p, p_lag, p_storm, p_lag_storm) ```

lindsayplatt commented 1 year ago

Closed by #12

LimnoDataScience / plume_bloom_drivers