DOI-USGS / lake-temperature-model-prep

Pipeline #1
Other
6 stars 13 forks source link

Verify MN Sentinel lakes program lakes appear in the "well-studied"/PGDL tiers #192

Closed jordansread closed 3 years ago

jordansread commented 3 years ago

See list of DOWs, which can be checked w/ the crosswalk https://www.dnr.state.mn.us/fisheries/slice/sentinel-lakes.html

jordansread commented 3 years ago

The 2_crosswalk_munge/out/mndow_nhdhr_xwalk.rds file will provide a crosswalk between the DOWs (MN state identifiers) and the national hydrography dataset identifiers that are used throughout this project.

This issue is meant to capture information regarding these important lakes, verifying they are indeed included in the project and likely creating a spin-off issue if we discover we are missing some of these lakes.

getting set up

1) checking whether lakes are included in the initial project lake shapefile

using a dplyr::filter or a join, check whether all sentinel lakes are included in mndow_nhdhr_xwalk.rds. Show code on this issue and comment on any missing lakes and whether any of these lakes share the same site_id (the id prefixed with "nhrhr_" in the crosswalk).

2) checking whether lakes have all modeling parameters

using a similar method as above, verify the sentinel lakes are included in the nml_list.rds, this time using their site_id from the crosswalk file. Show code and comment on any missing lake ids.

3) how much temperature data do these lakes have?

create a plot of number of observation counts per sentinel lake that are found in temp_data_with_sources.feather. I'd recommend a bar plot with the observation count on the y-axis and the lake name at the base of each bar (a single bar per lake). But something fancier to show the data volume per lake (or even through time) would be useful. Show code.

next steps

If we run into situations where some lakes are missing from the mndow_nhdhr_xwalk.rds, we will want to figure out why. Likewise, if lakes are in the mndow_nhdhr_xwalk.rds but not in the nml_list.rds, we will want to discover which missing data are limiting inclusion (the individual data files that go into that file are here. Also, if some of these lakes only have limited temperature data, it may mean we are missing key data sources. Each of these additional questions will require more detail to turn into useful issues, but providing here for context.

lindsayplatt commented 3 years ago

Just so we have a log for next time we bring someone in: 1) I gave @RAtshan access to the GD folder, 2) Those two data download targets are in getters.yml which is not linked to remake.yml, so we had to add that to the scmake call (I directly edited the jread comment above to reflect this change). All good now and she is off and running!

RAtshan commented 3 years ago

1) checking whether lakes are included in the initial project lake shapefile All sentinel lakes are included in the mndow_nhdhr_xwalk.rds. Also, none of the lakes in the crosswalk data share the same site_id.

The R code:

### building the sentinel lake list DOW ids vector
sentinel_lake_ids <- c("mndow_06000200", "mndow_69025400", 
                              "mndow_47004901", "mndow_21005700", 
                              "mndow_34003200", "mndow_49014000", 
                              "mndow_69061500", "mndow_69081000", 
                              "mndow_15001000", "mndow_16007700", 
                              "mndow_01014200", "mndow_07004400",
                              "mndow_73003700", "mndow_02000400", 
                              "mndow_29025000", "mndow_18038600", 
                              "mndow_41008900", "mndow_13002700", 
                              "mndow_44001400", "mndow_83004300", 
                              "mndow_81000300", "mndow_16038400", 
                              "mndow_11041300", "mndow_16004900",
                              "mndow_69000400")

## reading in the mndow data in as an R object. 
mndow_xwalk_data<- readRDS("2_crosswalk_munge/out/mndow_nhdhr_xwalk.rds")

# using filter to check if any of the sentinel lakes declared above are 
# missing from the mndoe_xwalk_data using filter. 
xwalk_sentinel_lake_check <- mndow_xwalk_data %>%
  filter(MNDOW_ID %in% sentinel_lake_ids)

# Check if all of the sentinel lakes are in the xwalk dataset
# by checking if number of lakes in mndow_xwalk_data data is the 
# same as the number of lakes in sentinel_lake_ids
nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids)

# to return a summary of the lakes associated with each site_id
count_xwalk_lake_siteID <- xwalk_sentinel_lake_check %>%
  group_by(site_id) %>%
  summarize(unique_lakes_sitID = list(unique(MNDOW_ID)),
            num_unique_lakes = length(unique(MNDOW_ID))) %>%
  ungroup()

# to select the site_id that associated with multiple lakes
# fitering that contains the site_id that associated with multiple lakes
# by filtering the num_unique_lakes that have more than one element. 
not_uniques_sentinel_lake_siteID <- count_xwalk_lake_siteID %>%
  filter(num_unique_lakes > 1)

R output:

The results:

Since number of sentinel lake in the xwalk dataset is the same as the number of sentinel lakes in the DOW ids vector we can can conclude that all sentinel lakes are included in the mndow_nhdhr_xwalk.rds.

Similarly, since the not_uniques_sentinel_lake_siteID returned a dataframe with 0 rows, because non of the unique MNDOW_ID have more than 1 element (DOW ids ). We can conclude that none of the lakes the crosswalk data share the same site_id.

jordansread commented 3 years ago

Thanks Rasha - I think it would be good to show how you verify (with code) that none of the sentinel lakes are missing. Do you assume that nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids) means none are missing?

Likewise, I'm not seeing how the list() of unique MNDOW_IDs is confirming none are repeated. Do you need to check the length of that list or perhaps use a different way of verifying there aren't repeats, such as

xwalk_sentinel_lake_check %>%
    group_by(site_id) %>%
    summarize(unique_site_id = length(unique(MNDOW_ID))) %>% filter(unique_site_id != 1)
RAtshan commented 3 years ago

Thanks Jordan for your feedback ~

Yes, by checking if the equality of nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids) to verify that none of the lakes are missing.

Similarly, checking the length of the unique(MNDOW_ID) list to verify that non of site_id are repeated

I updated my comment for part 1 to use R-code and R-output to verify and support the results.

jordansread commented 3 years ago

Thanks. I assume you've moved on to number 2, which is good.

RAtshan commented 3 years ago

2) checking whether lakes have all modeling parameters All sentinel lakes are included in the nml_list.rds. Also, all of the sentinel lake's site_id are repeated twice in the nml_list.rds.

The R code: Note that I joined the sentinel lake DOW MNDOW_ after filtering nml_list site_id

## reading in the nml_list data in as an R object.
nml_list <- readRDS('7_config_merge/out/nml_list.rds')
# extracting the site_id using the names of nml_list.
# site ids  are the same as the list element names.
nml_site_id <- tibble(site_id = names(nml_list))

## using filter to check if sentinel lakes are included in the nml_list.rds
## using the sentinel lakes site_id from xwalk_sentinel_lake_check data
  # 1) extracting the site_id column from xwalk_sentinel_lake_check
xwalk_site_id<- xwalk_sentinel_lake_check$site_id

  # 2) filtering the nml_site_id data site_id, 
  # check if sentinel lake site_id are in nml_site_id 
  # left joining the sentinel lakes ID
nml_list_sentinel_lake_check <-   nml_site_id %>%
  filter(site_id %in% xwalk_site_id) %>%
  left_join(xwalk_sentinel_lake_check)

# by checking if number of lakes in nml_unique_sites is the 
# same as the number of lakes in xwalk_sentinel_lake_check
nrow(xwalk_sentinel_lake_check) == length(unique(nml_list_sentinel_lake_check$site_id))

# to find the number of times site_id (sentinel lake's site_id) was repeated in the nml_list. 
count_nml_lake_siteID <- nml_list_sentinel_lake_check %>%
  group_by(MNDOW_ID) %>%
  summarize(nml_list_sitID = list((site_id)),
            num_repeated_sitID = length((site_id))) 

# to check if any of the the sential lake site_id was repated in the nml_list data
# after fidning the lenght of the site_id extracted from nml_list 
# filtering the site_id that occured more than once. 
not_uniques_nml_list_siteID <- count_nml_lake_siteID %>%
  filter(num_repeated_sitID > 1)

R output:

image

image image

The results:

Since number of sentinel lake's site_id in the xwalk dataset is the same as unique site_id filtered from nml_list, we can conclude that all sentinel lakes are included in the `nml_list.rds``.

In the meantime, from not_uniques_nml_list_siteID results above we notice that each sentinel lake's site_id was extracted twice from the filtered nml_list data. Hence, we conclude that all of lake's site_id is repeated in the nml_list.rds data

jordansread commented 3 years ago

The duplication of the site_ids looks like an issue we'll need to figure out elsewhere.

But I am wondering if you considered using the %in% function in R.

Like this:

all(xwalk_sentinel_lake_check$site_id %in% names(nml_list))

This verifies that all site_ids that are in the sentinel lake crosswalk are also in the nml_list

Checking that the length of the two vectors is sometimes a brittle way to check that two things are the same. What if the two vectors are the same length but don't have matches (due to an error elsewhere)? I think familiarizing yourself with %in% in this case and future cases would be helpful.

Regardless, it is clear that these lakes are in the nml_list and the project, so looking forward to seeing what we have in stage 3.

RAtshan commented 3 years ago

3) how much temperature data do these lakes have? part 3 (a): Bar plot of number of observations per sentinel lake.

After fetching the temperature data 7b_temp_merge/out/temp_data_with_sources.feather , I checked if the sentinel lake's site_id from mndow_nhdhr_xwalk.rds are in the temperature data. All sentinel lakes are included in the temperature data. Then, I created a bar plot of the number of observation (counts) for the sentinel lakes that are found in temp_data_with_sources.feather using the following steps:

The R code:

## read the feather data as an R object
temp_data_sentinel_lake <- feather::read_feather("7b_temp_merge/out/temp_data_with_sources.feather") 

# first check if all xwalk data site_id are in the temp data. 
all(xwalk_sentinel_lake_check$site_id %in% temp_data_sentinel_lake$site_id)

# downloading the csv file that has MN Sentinel lake names and MNDOW_ID
lake_names <- readr::read_csv("sentinel_lake_names.csv") %>% 
  mutate(MNDOW_ID = paste0("mndow_", `MN DOW`)) %>%
  select(-c("MN DOW"))

### filtering the site ideas in the temp data
## getting the over all observations for eac lake
count_sentinel_lake_obs <- temp_data_sentinel_lake %>%
  # filter the temp data to site_id in xwalk data
  filter(site_id %in% xwalk_site_id) %>%
  # group by site_id 
  group_by(site_id) %>%
  # find the count/volume of temp observations for each site_id
  summarize(n_per_lake = n()) %>%
  # left join the sentinel lake DOW ids and names to the summarized data for tracking the lakes
  left_join(xwalk_sentinel_lake_check) %>%
  left_join(lake_names, by = "MNDOW_ID") %>%
  arrange(n_per_lake)

p <- ggplot(count_sentinel_lake_obs, 
  aes(x = reorder(factor(Lake), -n_per_lake), y = n_per_lake))+
  geom_bar(stat = 'identity', color="black", fill="grey90") +
  theme_bw() +
  cowplot::theme_cowplot() +
  # to center the plot title
  theme(plot.title = element_text(hjust = 0.5),
        # to rotate and space the x-axis labels 
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  #axis.text.x = element_blank()) + #legend.position = "none") +
  labs(title = 'Temperature Data Volume for MN Sentinel Lakes',
       x = 'MN Sentinel Lake',
       y = 'Observation Volume') +
  geom_text(aes(label = n_per_lake), vjust = 0)

R output:

image

image

The results:

RAtshan commented 3 years ago

The next step is creating time series plots to show the volume of temperature data for each lake separately. @lindsayplatt and I discussed it and I prepared/processed the data for plotting.

jordansread commented 3 years ago

Thanks, good to hear. It would also be helpful to see where the data are coming from for these lakes. That would be found in the source column in the temperature data file. Please plan to include that in the plot.

RAtshan commented 3 years ago

3) how much temperature data do these lakes have? Part 3 (b): Creating time series plots of the number of observations for each sentinel lake.

To create time series for each sentinel lake and include the measured temperature's source, first, I obtained the descriptive statistics of the temperature's data source. The descriptive statistics describes the number and a list of sources associated with sentinel lake.

The R code:

### filtering site ids from temp data
# looking at the obsrvation per year for each lake.

# 1) looking at the number and list of source for the recorded temp for each Sentinel lake
# it is clear that each Sentinel lake has more than one source. 
source_count_per_lake <- temp_data %>%
  filter(site_id %in% xwalk_site_id) %>%
  group_by(site_id) %>%
  summarize(unique_lake_source = list(unique(source_id)),
            n_unique_sources = length(unique(source_id)))

## filtering the temp data to the data associated with sentinel lakes. 
# getting the number of recodred observations per year without source. 
annual_count_sentinel_lake_obs <- temp_data %>%
  mutate(year = lubridate::year(date)) %>%
  filter(site_id %in% xwalk_site_id) %>%
  group_by(site_id, year) %>%
  summarize(n_per_year = n()) %>%
  # adding total_per_year that summarize the number of observations for each 
  # site_id (lake) per year.
  mutate(source_id = "total_per_year")

## filtering the temp data to the data associated with sentinel lakes.
# getting the number of recodred observations per year per source. 
count_sentinel_lake_obs_year <- temp_data %>%
  mutate(year = lubridate::year(date)) %>%
  filter(site_id %in% xwalk_site_id) %>%
  group_by(site_id, year, source_id) %>%
  summarize(n_per_year = n()) %>%
  rbind(annual_count_sentinel_lake_obs)%>%
  left_join(xwalk_sentinel_lake_check) %>%
  left_join(lake_names, by = "MNDOW_ID") %>%
  # to remove NA from the source_id we replaced the column with a column
  # that has the same name. 
  mutate(source_id = ifelse(is.na(source_id), 
                               "No_source_provided", source_id))

summary(count_sentinel_lake_obs_year)

#### plotting time series the data that provides 
# the number of observations per sentinel lakes. 
# using the number of observations on the y-axis
# using the source of measured tempeautre as the color of the plotted points.
# declaring source_id as factor
count_sentinel_lake_obs_year$source_id <- factor(count_sentinel_lake_obs_year$source_id)
# extracting the unique site_id from the count data.
plot_source <- unique(count_sentinel_lake_obs_year$source_id)
# setting each site_id with color.
plot_color <- setNames(c("#0087ff", "#00005f", "#008700", "#DDAA33", "#00ff5f", "#ff00ff", "#d70000","#BBBBBB"), plot_source)

###
# number of figures per plot/page
plots_per_fig <- 4
# list of the unique lake extracted from data set
sentinel_lakes <- unique(count_sentinel_lake_obs_year$Lake)
# number of plots/ pages we need printed
# the number of unique pages divided by the number of figured in each page. 
# ceiling rounds a number  to the next integer. 
n_plots <- ceiling(length(sentinel_lakes) / plots_per_fig) 

for (i in 1:n_plots) { 

    p_each_lake <- ggplot(count_sentinel_lake_obs_year, aes(x = year, y = n_per_year)) +
      geom_line(aes(colour= source_id), alpha = 0.45) +
      geom_point(aes(colour= source_id), size = 3, alpha = 0.55) +
      #break is matching the sorce from the data to the corresponding color
      scale_color_manual(breaks = names(plot_color), values = plot_color) +
      # using facet_wrap_paginate to plot the 25  sentinel lakes on one plot: 
      # setting 7 plots: each plot has 4 figures but the last plot has 1 figure
      facet_wrap_paginate(~ Lake, strip.position = "right", 
                          ncol = ifelse(i == 7, 1, 2) ,
                          nrow = ifelse(i == 7, 1, 2) 
                          , scales='free', page = i) +
      theme_bw() +
      cowplot::theme_cowplot() +
      theme(plot.title = element_text(hjust = 0.5),
            legend.title = element_blank(),
            legend.position = "bottom") +
      labs(title = 'Sentinel Lakes Temperature Data Volume',
           x = 'Date: years',
           y = 'Number of observations') 

    plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
                       i)
    ggsave(plot_out, p_each_lake, height = 5, width = 9, dpi = 250)
    print(p_each_lake)
} 

R output:

image

image

image

image

image

image

image

image

The results:

aappling-usgs commented 3 years ago

Looking good! Nice to learn about facet_wrap_paginate, which I'd never seen before.

There are benefits to the line plots you've selected, but another way to visualize these annual observation counts by source would be a stacked bar plot or area plot, for which the top of the bars/areas would indicate the total number of observations across all sources (so you wouldn't need a separate total_per_year), and the vertical thickness of each color bar/area would show the contribution from each agency. Just food for thought and maybe a future data visualization.

jordansread commented 3 years ago

Nice work @RAtshan I would like to share this with the MN DNR group so they can provide additional data if they have it beyond these sources. But I think we need two updates to do that:

RAtshan commented 3 years ago

3) how much temperature data do these lakes have? Part 3 (c): Creating stacked bar plot of the number of observations for each sentinel lake.

The R code:

## filtering the temp data to the data associated with sentinel lake.
# getting the number of recorded observations per year per source.
count_sentinel_lake_obs_year <- temp_data %>%
  mutate(year = lubridate::year(date)) %>%
  filter(site_id %in% xwalk_site_id) %>%
  group_by(site_id, year, source_id) %>%
  summarize(n_per_year = n()) %>%
  left_join(xwalk_sentinel_lake_check) %>%
  left_join(lake_names, by = "MNDOW_ID") %>%
  # to remove NA from the source_id we replaced the column with a column
  # that has the same name. and assigned the NA to water_quality_portal
  mutate(source_id = ifelse(is.na(source_id),
                               "water_quality_portal", source_id))

# creating stacked barplots for each lake. each year is a bar.
# x-axis is the year, y-axis is the number of observations. 
# each bar is a stack of the number of observation from the different sources
# the top of the bars shows the total number of observations across all sources 
# and the vertical thickness of each color bar would show the number of observation for each source.
for (i in 1:n_plots) {
  #creates a plot for each lake. Each line on these plots represent the source
  # of the data collection. 
  p_each_lake <- ggplot(count_sentinel_lake_obs_year, 
                        aes(x = year, y = n_per_year, fill = source_id)) +
    geom_bar(stat = "identity") +
    #break is matching the source from the data to the corresponding color
    scale_fill_manual(breaks = names(plot_color), values = plot_color) +
    # using facet_wrap_paginate to plot the 25  sentinel lakes on one plot:
    # setting 7 plots: each plot has 4 figures but the last plot has 1 figure
    facet_wrap_paginate(~ Lake, strip.position = "right",
                        ncol = ifelse(i == 7, 1, 2) ,
                        nrow = ifelse(i == 7, 1, 2)
                        , scales='free', page = i) +
    theme_bw() +
    cowplot::theme_cowplot() +
    theme(plot.title = element_text(hjust = 0.5),
          legend.title = element_blank(),
          legend.position = "bottom") +
    labs(title = 'Sentinel Lakes Temperature Data Volume',
         x = 'Date: years',
         y = 'Number of observations') 
   plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
                     i)
  ggsave(plot_out, p_each_lake, height = 5, width = 10, dpi = 250)
  print(p_each_lake)
}

R output:

The results:

jordansread commented 3 years ago

This is great. Last change I'd request is locking all of the x-axis ranges to be the same. 1980 to 2020 would be best. Otherwise, all set.

RAtshan commented 3 years ago

Part 3 (c): Creating stacked bar plot of the number of observations for each sentinel lake.

Data processing: The data processing was the same as part (c) expect I filtered the year: 1980 < year < 2020.

R code:

## filtering the temp data to the data associated with sentinel lake.
# getting the number of recorded observations per year per source.
count_sentinel_lake_obs_year <- temp_data %>%
  mutate(year = lubridate::year(date)) %>%
  filter(year >= 1980 & year <= 2020) %>%
  filter(site_id %in% xwalk_site_id) %>%
  group_by(site_id, year, source_id) %>%
  summarize(n_per_year = n()) %>%
  #rbind(annual_count_sentinel_lake_obs)%>%
  left_join(xwalk_sentinel_lake_check) %>%
  left_join(lake_names, by = "MNDOW_ID") %>%
  # to remove NA from the source_id we replaced the column with a column
  # that has the same name. and assigned the NA to water_quality_portal
  mutate(source_id = ifelse(is.na(source_id),
                               "water_quality_portal", source_id))

summary(count_sentinel_lake_obs_year)

count_sentinel_lake_obs_year$source_id <- factor(count_sentinel_lake_obs_year$source_id)
# extracting the unique site_id from the count data.
plot_source <- unique(count_sentinel_lake_obs_year$source_id)
# setting each site_id with color.
plot_color <- setNames(c("#af87ff", "#00005f", "#0087ff", "#00d7ff", "#00ff5f", "#0000af", "#5f00ff"), plot_source)

###
# number of figures per plot/page
plots_per_fig <- 4
# list of the unique lake extracted from data set
sentinel_lakes <- unique(count_sentinel_lake_obs_year$Lake)
n_plots <- ceiling(length(sentinel_lakes) / plots_per_fig)
## extracting the min and max of the year column for plotting the x-axis.
year_range <- range(count_sentinel_lake_obs_year$year

# creating stacked barplots for each lake. each year is a bar.
# x-axis is the year, y-axis is the number of observations.
# each bar is a stack of the number of observation from the different sources
# the top of the bars shows the total number of observations across all sources
# and the vertical thickness of each color bar would show the number of observation for each source.
for (i in 1:n_plots) {
  #creates a plot for each lake. Each line on these plots represent the source
  # of the data collection.
  p_each_lake <- ggplot(count_sentinel_lake_obs_year,
                        aes(x = year, y = n_per_year, fill = source_id)) +
    geom_bar(stat = "identity") +
    #break is matching the source from the data to the corresponding color
    scale_fill_manual(breaks = names(plot_color), values = plot_color) +
    scale_x_continuous(limits = year_range) +
    # using facet_wrap_paginate to plot the 25  sentinel lakes on one plot:
    # setting 7 plots: each plot has 4 figures but the last plot has 1 figure
    facet_wrap_paginate(~ Lake, strip.position = "right",
                        ncol = ifelse(i == 7, 1, 2) ,
                        nrow = ifelse(i == 7, 1, 2),
                        scales='free', page = i) +
    theme_bw() +
    cowplot::theme_cowplot() +
    theme(plot.title = element_text(hjust = 0.5),
          legend.title = element_blank(),
          legend.position = "bottom") +
    labs(title = 'Sentinel Lakes Temperature Data Volume',
         x = 'Date: years',
         y = 'Number of observations')
   plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
                     i)
  ggsave(plot_out, p_each_lake, height = 5, width = 10, dpi = 250)
  print(p_each_lake)
}

R_Output

Sentinal_lakes_timeseries_set_1 Sentinal_lakes_timeseries_set_2 Sentinal_lakes_timeseries_set_3 Sentinal_lakes_timeseries_set_4 Sentinal_lakes_timeseries_set_5 Sentinal_lakes_timeseries_set_6 Sentinal_lakes_timeseries_set_7

jordansread commented 3 years ago

Looks good, I am satisfied with this task and we can all it done. Thanks!