Closed jordansread closed 3 years ago
The 2_crosswalk_munge/out/mndow_nhdhr_xwalk.rds
file will provide a crosswalk between the DOWs (MN state identifiers) and the national hydrography dataset identifiers that are used throughout this project.
This issue is meant to capture information regarding these important lakes, verifying they are indeed included in the project and likely creating a spin-off issue if we discover we are missing some of these lakes.
scmake('2_crosswalk_munge/out/mndow_nhdhr_xwalk.rds', remake_file = 'getters.yml')
once the pipeline is set up locally will make sure that file is current locally. scmake('7_config_merge/out/nml_list.rds', remake_file = 'getters.yml')
will fetch the current file that includes lake parameters, including maximum depth, water clarity, and lake size. scmake('7b_temp_merge/out/temp_data_with_sources.feather', remake_file = 'getters.yml')
will fetch the current temperature data. mndow_
in the crosswalk, this needs to be included in the vector, so Artichoke Lake should be "mndow_06000200"
, not 06000200
(also important that this is stored as a character, not a number so it doesn't lose the "0" in front of the id). using a dplyr::filter
or a join, check whether all sentinel lakes are included in mndow_nhdhr_xwalk.rds
. Show code on this issue and comment on any missing lakes and whether any of these lakes share the same site_id
(the id prefixed with "nhrhr_"
in the crosswalk).
using a similar method as above, verify the sentinel lakes are included in the nml_list.rds
, this time using their site_id
from the crosswalk file. Show code and comment on any missing lake ids.
create a plot of number of observation counts per sentinel lake that are found in temp_data_with_sources.feather
. I'd recommend a bar plot with the observation count on the y-axis and the lake name at the base of each bar (a single bar per lake). But something fancier to show the data volume per lake (or even through time) would be useful. Show code.
If we run into situations where some lakes are missing from the mndow_nhdhr_xwalk.rds
, we will want to figure out why. Likewise, if lakes are in the mndow_nhdhr_xwalk.rds
but not in the nml_list.rds
, we will want to discover which missing data are limiting inclusion (the individual data files that go into that file are here. Also, if some of these lakes only have limited temperature data, it may mean we are missing key data sources. Each of these additional questions will require more detail to turn into useful issues, but providing here for context.
Just so we have a log for next time we bring someone in: 1) I gave @RAtshan access to the GD folder, 2) Those two data download targets are in getters.yml
which is not linked to remake.yml
, so we had to add that to the scmake
call (I directly edited the jread comment above to reflect this change). All good now and she is off and running!
1) checking whether lakes are included in the initial project lake shapefile
All sentinel lakes are included in the mndow_nhdhr_xwalk.rds
. Also, none of the lakes in the crosswalk data share the same site_id
.
The R code:
### building the sentinel lake list DOW ids vector
sentinel_lake_ids <- c("mndow_06000200", "mndow_69025400",
"mndow_47004901", "mndow_21005700",
"mndow_34003200", "mndow_49014000",
"mndow_69061500", "mndow_69081000",
"mndow_15001000", "mndow_16007700",
"mndow_01014200", "mndow_07004400",
"mndow_73003700", "mndow_02000400",
"mndow_29025000", "mndow_18038600",
"mndow_41008900", "mndow_13002700",
"mndow_44001400", "mndow_83004300",
"mndow_81000300", "mndow_16038400",
"mndow_11041300", "mndow_16004900",
"mndow_69000400")
## reading in the mndow data in as an R object.
mndow_xwalk_data<- readRDS("2_crosswalk_munge/out/mndow_nhdhr_xwalk.rds")
# using filter to check if any of the sentinel lakes declared above are
# missing from the mndoe_xwalk_data using filter.
xwalk_sentinel_lake_check <- mndow_xwalk_data %>%
filter(MNDOW_ID %in% sentinel_lake_ids)
# Check if all of the sentinel lakes are in the xwalk dataset
# by checking if number of lakes in mndow_xwalk_data data is the
# same as the number of lakes in sentinel_lake_ids
nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids)
# to return a summary of the lakes associated with each site_id
count_xwalk_lake_siteID <- xwalk_sentinel_lake_check %>%
group_by(site_id) %>%
summarize(unique_lakes_sitID = list(unique(MNDOW_ID)),
num_unique_lakes = length(unique(MNDOW_ID))) %>%
ungroup()
# to select the site_id that associated with multiple lakes
# fitering that contains the site_id that associated with multiple lakes
# by filtering the num_unique_lakes that have more than one element.
not_uniques_sentinel_lake_siteID <- count_xwalk_lake_siteID %>%
filter(num_unique_lakes > 1)
R output:
To verify that non of the sentinel lakes are missing in the xwalk data: checking if the number of sentinel lake in the xwalk dataset is the same as the number of sentinel lakes in the DOW ids vector.
To verify if any lakes share the same site_id in xwalk data: filtering the lists of unique MNDOW_ID
with more than 1 element (DOW ids ).
The results:
Since number of sentinel lake in the xwalk dataset is the same as the number of sentinel lakes in the DOW ids vector we can can conclude that all sentinel lakes are included in the mndow_nhdhr_xwalk.rds
.
Similarly, since the not_uniques_sentinel_lake_siteID
returned a dataframe with 0 rows, because non of the unique MNDOW_ID
have more than 1 element (DOW ids ). We can conclude that none of the lakes the crosswalk data share the same site_id
.
Thanks Rasha - I think it would be good to show how you verify (with code) that none of the sentinel lakes are missing. Do you assume that nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids)
means none are missing?
Likewise, I'm not seeing how the list()
of unique MNDOW_IDs is confirming none are repeated. Do you need to check the length of that list or perhaps use a different way of verifying there aren't repeats, such as
xwalk_sentinel_lake_check %>%
group_by(site_id) %>%
summarize(unique_site_id = length(unique(MNDOW_ID))) %>% filter(unique_site_id != 1)
Thanks Jordan for your feedback ~
Yes, by checking if the equality of nrow(xwalk_sentinel_lake_check) == length(sentinel_lake_ids)
to verify that none of the lakes are missing.
Similarly, checking the length of the unique(MNDOW_ID)
list to verify that non of site_id
are repeated
I updated my comment for part 1 to use R-code and R-output to verify and support the results.
Thanks. I assume you've moved on to number 2, which is good.
2) checking whether lakes have all modeling parameters
All sentinel lakes are included in the nml_list.rds
. Also, all of the sentinel lake's site_id
are repeated twice in the nml_list.rds
.
The R code:
Note that I joined the sentinel lake DOW MNDOW_
after filtering nml_list site_id
## reading in the nml_list data in as an R object.
nml_list <- readRDS('7_config_merge/out/nml_list.rds')
# extracting the site_id using the names of nml_list.
# site ids are the same as the list element names.
nml_site_id <- tibble(site_id = names(nml_list))
## using filter to check if sentinel lakes are included in the nml_list.rds
## using the sentinel lakes site_id from xwalk_sentinel_lake_check data
# 1) extracting the site_id column from xwalk_sentinel_lake_check
xwalk_site_id<- xwalk_sentinel_lake_check$site_id
# 2) filtering the nml_site_id data site_id,
# check if sentinel lake site_id are in nml_site_id
# left joining the sentinel lakes ID
nml_list_sentinel_lake_check <- nml_site_id %>%
filter(site_id %in% xwalk_site_id) %>%
left_join(xwalk_sentinel_lake_check)
# by checking if number of lakes in nml_unique_sites is the
# same as the number of lakes in xwalk_sentinel_lake_check
nrow(xwalk_sentinel_lake_check) == length(unique(nml_list_sentinel_lake_check$site_id))
# to find the number of times site_id (sentinel lake's site_id) was repeated in the nml_list.
count_nml_lake_siteID <- nml_list_sentinel_lake_check %>%
group_by(MNDOW_ID) %>%
summarize(nml_list_sitID = list((site_id)),
num_repeated_sitID = length((site_id)))
# to check if any of the the sential lake site_id was repated in the nml_list data
# after fidning the lenght of the site_id extracted from nml_list
# filtering the site_id that occured more than once.
not_uniques_nml_list_siteID <- count_nml_lake_siteID %>%
filter(num_repeated_sitID > 1)
R output:
site_id
from xwalk data to filter nml_list
, we used the number of rows (sentinel lakes) from xwalk dataset to verify that none of the sentinel lakes are missing in the nml_list: checking if the number of rows from xwalk is the same as the number of unique site_id extracted from nml_listsite_id
in nml_list data: summarizing the list and number of site_id
associated with each sentinel lake DOW.
The results:
Since number of sentinel lake's site_id
in the xwalk dataset is the same as unique site_id
filtered from nml_list, we can conclude that all sentinel lakes are included in the `nml_list.rds``.
In the meantime, from not_uniques_nml_list_siteID
results above we notice that each sentinel lake's site_id
was extracted twice from the filtered nml_list
data. Hence, we conclude that all of lake's site_id
is repeated in the nml_list.rds
data
The duplication of the site_ids looks like an issue we'll need to figure out elsewhere.
But I am wondering if you considered using the %in%
function in R.
Like this:
all(xwalk_sentinel_lake_check$site_id %in% names(nml_list))
This verifies that all site_ids that are in the sentinel lake crosswalk are also in the nml_list
Checking that the length of the two vectors is sometimes a brittle way to check that two things are the same. What if the two vectors are the same length but don't have matches (due to an error elsewhere)? I think familiarizing yourself with %in%
in this case and future cases would be helpful.
Regardless, it is clear that these lakes are in the nml_list and the project, so looking forward to seeing what we have in stage 3.
3) how much temperature data do these lakes have? part 3 (a): Bar plot of number of observations per sentinel lake.
After fetching the temperature data 7b_temp_merge/out/temp_data_with_sources.feather
, I checked if the sentinel lake's site_id from mndow_nhdhr_xwalk.rds
are in the temperature data. All sentinel lakes are included in the temperature data. Then, I created a bar plot of the number of observation (counts) for the sentinel lakes that are found in temp_data_with_sources.feather
using the following steps:
site_id
from temperature data. Then, I found the number of observations associated with the filtered site_id
. Next, I joined the sentinel lake's names and DOW ids to the count data and sorted the data based on the number of observations for each lake in an increasing order.
The R code:
## read the feather data as an R object
temp_data_sentinel_lake <- feather::read_feather("7b_temp_merge/out/temp_data_with_sources.feather")
# first check if all xwalk data site_id are in the temp data.
all(xwalk_sentinel_lake_check$site_id %in% temp_data_sentinel_lake$site_id)
# downloading the csv file that has MN Sentinel lake names and MNDOW_ID
lake_names <- readr::read_csv("sentinel_lake_names.csv") %>%
mutate(MNDOW_ID = paste0("mndow_", `MN DOW`)) %>%
select(-c("MN DOW"))
### filtering the site ideas in the temp data
## getting the over all observations for eac lake
count_sentinel_lake_obs <- temp_data_sentinel_lake %>%
# filter the temp data to site_id in xwalk data
filter(site_id %in% xwalk_site_id) %>%
# group by site_id
group_by(site_id) %>%
# find the count/volume of temp observations for each site_id
summarize(n_per_lake = n()) %>%
# left join the sentinel lake DOW ids and names to the summarized data for tracking the lakes
left_join(xwalk_sentinel_lake_check) %>%
left_join(lake_names, by = "MNDOW_ID") %>%
arrange(n_per_lake)
p <- ggplot(count_sentinel_lake_obs,
aes(x = reorder(factor(Lake), -n_per_lake), y = n_per_lake))+
geom_bar(stat = 'identity', color="black", fill="grey90") +
theme_bw() +
cowplot::theme_cowplot() +
# to center the plot title
theme(plot.title = element_text(hjust = 0.5),
# to rotate and space the x-axis labels
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
#axis.text.x = element_blank()) + #legend.position = "none") +
labs(title = 'Temperature Data Volume for MN Sentinel Lakes',
x = 'MN Sentinel Lake',
y = 'Observation Volume') +
geom_text(aes(label = n_per_lake), vjust = 0)
R output:
mndow_nhdhr_xwalk.rds
are in the temperature data:The results:
site_id
from xwalk are included in the temperature data, see R code and results above for verification. The next step is creating time series plots to show the volume of temperature data for each lake separately. @lindsayplatt and I discussed it and I prepared/processed the data for plotting.
Thanks, good to hear. It would also be helpful to see where the data are coming from for these lakes. That would be found in the source
column in the temperature data file. Please plan to include that in the plot.
3) how much temperature data do these lakes have? Part 3 (b): Creating time series plots of the number of observations for each sentinel lake.
To create time series for each sentinel lake and include the measured temperature's source, first, I obtained the descriptive statistics of the temperature's data source. The descriptive statistics describes the number and a list of sources associated with sentinel lake.
site_id
and joining the sentinel lake's names and DOW ids. In the meantime, I found the number of observations (count) associated with the filtered site_id
during a certain year. Then, I found the number of observations during the same years while considering the source of the measured temperature. The R code:
### filtering site ids from temp data
# looking at the obsrvation per year for each lake.
# 1) looking at the number and list of source for the recorded temp for each Sentinel lake
# it is clear that each Sentinel lake has more than one source.
source_count_per_lake <- temp_data %>%
filter(site_id %in% xwalk_site_id) %>%
group_by(site_id) %>%
summarize(unique_lake_source = list(unique(source_id)),
n_unique_sources = length(unique(source_id)))
## filtering the temp data to the data associated with sentinel lakes.
# getting the number of recodred observations per year without source.
annual_count_sentinel_lake_obs <- temp_data %>%
mutate(year = lubridate::year(date)) %>%
filter(site_id %in% xwalk_site_id) %>%
group_by(site_id, year) %>%
summarize(n_per_year = n()) %>%
# adding total_per_year that summarize the number of observations for each
# site_id (lake) per year.
mutate(source_id = "total_per_year")
## filtering the temp data to the data associated with sentinel lakes.
# getting the number of recodred observations per year per source.
count_sentinel_lake_obs_year <- temp_data %>%
mutate(year = lubridate::year(date)) %>%
filter(site_id %in% xwalk_site_id) %>%
group_by(site_id, year, source_id) %>%
summarize(n_per_year = n()) %>%
rbind(annual_count_sentinel_lake_obs)%>%
left_join(xwalk_sentinel_lake_check) %>%
left_join(lake_names, by = "MNDOW_ID") %>%
# to remove NA from the source_id we replaced the column with a column
# that has the same name.
mutate(source_id = ifelse(is.na(source_id),
"No_source_provided", source_id))
summary(count_sentinel_lake_obs_year)
#### plotting time series the data that provides
# the number of observations per sentinel lakes.
# using the number of observations on the y-axis
# using the source of measured tempeautre as the color of the plotted points.
# declaring source_id as factor
count_sentinel_lake_obs_year$source_id <- factor(count_sentinel_lake_obs_year$source_id)
# extracting the unique site_id from the count data.
plot_source <- unique(count_sentinel_lake_obs_year$source_id)
# setting each site_id with color.
plot_color <- setNames(c("#0087ff", "#00005f", "#008700", "#DDAA33", "#00ff5f", "#ff00ff", "#d70000","#BBBBBB"), plot_source)
###
# number of figures per plot/page
plots_per_fig <- 4
# list of the unique lake extracted from data set
sentinel_lakes <- unique(count_sentinel_lake_obs_year$Lake)
# number of plots/ pages we need printed
# the number of unique pages divided by the number of figured in each page.
# ceiling rounds a number to the next integer.
n_plots <- ceiling(length(sentinel_lakes) / plots_per_fig)
for (i in 1:n_plots) {
p_each_lake <- ggplot(count_sentinel_lake_obs_year, aes(x = year, y = n_per_year)) +
geom_line(aes(colour= source_id), alpha = 0.45) +
geom_point(aes(colour= source_id), size = 3, alpha = 0.55) +
#break is matching the sorce from the data to the corresponding color
scale_color_manual(breaks = names(plot_color), values = plot_color) +
# using facet_wrap_paginate to plot the 25 sentinel lakes on one plot:
# setting 7 plots: each plot has 4 figures but the last plot has 1 figure
facet_wrap_paginate(~ Lake, strip.position = "right",
ncol = ifelse(i == 7, 1, 2) ,
nrow = ifelse(i == 7, 1, 2)
, scales='free', page = i) +
theme_bw() +
cowplot::theme_cowplot() +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom") +
labs(title = 'Sentinel Lakes Temperature Data Volume',
x = 'Date: years',
y = 'Number of observations')
plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
i)
ggsave(plot_out, p_each_lake, height = 5, width = 9, dpi = 250)
print(p_each_lake)
}
R output:
site_id
:The results:
Looking good! Nice to learn about facet_wrap_paginate
, which I'd never seen before.
There are benefits to the line plots you've selected, but another way to visualize these annual observation counts by source would be a stacked bar plot or area plot, for which the top of the bars/areas would indicate the total number of observations across all sources (so you wouldn't need a separate total_per_year
), and the vertical thickness of each color bar/area would show the contribution from each agency. Just food for thought and maybe a future data visualization.
Nice work @RAtshan I would like to share this with the MN DNR group so they can provide additional data if they have it beyond these sources. But I think we need two updates to do that:
NA
source ids are all water quality portal data, so you can replace that text w/ "water quality portal". I am verifying this is the case.3) how much temperature data do these lakes have? Part 3 (c): Creating stacked bar plot of the number of observations for each sentinel lake.
Data processing:
The data processing was the same as part (b) expect the NA
source were assigned to "water_quality_portal".
Data visualizing:
I created stacked bar plots: year
on the x-axis and number of observations
on the y-axis. And the stacked bars represent the number of observations from each source
.
The R code:
## filtering the temp data to the data associated with sentinel lake.
# getting the number of recorded observations per year per source.
count_sentinel_lake_obs_year <- temp_data %>%
mutate(year = lubridate::year(date)) %>%
filter(site_id %in% xwalk_site_id) %>%
group_by(site_id, year, source_id) %>%
summarize(n_per_year = n()) %>%
left_join(xwalk_sentinel_lake_check) %>%
left_join(lake_names, by = "MNDOW_ID") %>%
# to remove NA from the source_id we replaced the column with a column
# that has the same name. and assigned the NA to water_quality_portal
mutate(source_id = ifelse(is.na(source_id),
"water_quality_portal", source_id))
# creating stacked barplots for each lake. each year is a bar.
# x-axis is the year, y-axis is the number of observations.
# each bar is a stack of the number of observation from the different sources
# the top of the bars shows the total number of observations across all sources
# and the vertical thickness of each color bar would show the number of observation for each source.
for (i in 1:n_plots) {
#creates a plot for each lake. Each line on these plots represent the source
# of the data collection.
p_each_lake <- ggplot(count_sentinel_lake_obs_year,
aes(x = year, y = n_per_year, fill = source_id)) +
geom_bar(stat = "identity") +
#break is matching the source from the data to the corresponding color
scale_fill_manual(breaks = names(plot_color), values = plot_color) +
# using facet_wrap_paginate to plot the 25 sentinel lakes on one plot:
# setting 7 plots: each plot has 4 figures but the last plot has 1 figure
facet_wrap_paginate(~ Lake, strip.position = "right",
ncol = ifelse(i == 7, 1, 2) ,
nrow = ifelse(i == 7, 1, 2)
, scales='free', page = i) +
theme_bw() +
cowplot::theme_cowplot() +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom") +
labs(title = 'Sentinel Lakes Temperature Data Volume',
x = 'Date: years',
y = 'Number of observations')
plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
i)
ggsave(plot_out, p_each_lake, height = 5, width = 10, dpi = 250)
print(p_each_lake)
}
R output:
The results:
This is great. Last change I'd request is locking all of the x-axis ranges to be the same. 1980 to 2020 would be best. Otherwise, all set.
Part 3 (c): Creating stacked bar plot of the number of observations for each sentinel lake.
Data processing: The data processing was the same as part (c) expect I filtered the year: 1980 < year < 2020.
R code:
## filtering the temp data to the data associated with sentinel lake.
# getting the number of recorded observations per year per source.
count_sentinel_lake_obs_year <- temp_data %>%
mutate(year = lubridate::year(date)) %>%
filter(year >= 1980 & year <= 2020) %>%
filter(site_id %in% xwalk_site_id) %>%
group_by(site_id, year, source_id) %>%
summarize(n_per_year = n()) %>%
#rbind(annual_count_sentinel_lake_obs)%>%
left_join(xwalk_sentinel_lake_check) %>%
left_join(lake_names, by = "MNDOW_ID") %>%
# to remove NA from the source_id we replaced the column with a column
# that has the same name. and assigned the NA to water_quality_portal
mutate(source_id = ifelse(is.na(source_id),
"water_quality_portal", source_id))
summary(count_sentinel_lake_obs_year)
count_sentinel_lake_obs_year$source_id <- factor(count_sentinel_lake_obs_year$source_id)
# extracting the unique site_id from the count data.
plot_source <- unique(count_sentinel_lake_obs_year$source_id)
# setting each site_id with color.
plot_color <- setNames(c("#af87ff", "#00005f", "#0087ff", "#00d7ff", "#00ff5f", "#0000af", "#5f00ff"), plot_source)
###
# number of figures per plot/page
plots_per_fig <- 4
# list of the unique lake extracted from data set
sentinel_lakes <- unique(count_sentinel_lake_obs_year$Lake)
n_plots <- ceiling(length(sentinel_lakes) / plots_per_fig)
## extracting the min and max of the year column for plotting the x-axis.
year_range <- range(count_sentinel_lake_obs_year$year
# creating stacked barplots for each lake. each year is a bar.
# x-axis is the year, y-axis is the number of observations.
# each bar is a stack of the number of observation from the different sources
# the top of the bars shows the total number of observations across all sources
# and the vertical thickness of each color bar would show the number of observation for each source.
for (i in 1:n_plots) {
#creates a plot for each lake. Each line on these plots represent the source
# of the data collection.
p_each_lake <- ggplot(count_sentinel_lake_obs_year,
aes(x = year, y = n_per_year, fill = source_id)) +
geom_bar(stat = "identity") +
#break is matching the source from the data to the corresponding color
scale_fill_manual(breaks = names(plot_color), values = plot_color) +
scale_x_continuous(limits = year_range) +
# using facet_wrap_paginate to plot the 25 sentinel lakes on one plot:
# setting 7 plots: each plot has 4 figures but the last plot has 1 figure
facet_wrap_paginate(~ Lake, strip.position = "right",
ncol = ifelse(i == 7, 1, 2) ,
nrow = ifelse(i == 7, 1, 2),
scales='free', page = i) +
theme_bw() +
cowplot::theme_cowplot() +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank(),
legend.position = "bottom") +
labs(title = 'Sentinel Lakes Temperature Data Volume',
x = 'Date: years',
y = 'Number of observations')
plot_out <- sprintf("reservoir_fetch/out/Sentinal_lakes_timeseries_set_%s.png",
i)
ggsave(plot_out, p_each_lake, height = 5, width = 10, dpi = 250)
print(p_each_lake)
}
R_Output
Looks good, I am satisfied with this task and we can all it done. Thanks!
See list of DOWs, which can be checked w/ the crosswalk https://www.dnr.state.mn.us/fisheries/slice/sentinel-lakes.html