Closed HeatherGranger closed 4 years ago
As this is the same issue as #75, I'm going to close that one and rename this one to reflect the more general issue.
The issue is that there is a predefined list (wqbc::codes
) that is much smaller than what is in ems, and anything that is not in that list is thrown away. The tricky part is that there only limits for a small subset (wqbc::limits
), and the units vary depending on the analytical method. Started work in the fix-drop-variables
branch.
Brain dump as I won't be able to get back to this until later next week...
Two things need to happen for valid comparison against limits/guidelines:
We need to know what variables we are talking about (e.g., Nitrogen dissolved, Nitrogen total, etc compared to the "Nitrogen" guideline) - this is done via substitute_variables
inside standardize_wqdata
, which is called by clean_wqdata
Units in the data need to be comparable against the units that the limits are specified in. This is done via substitute_units
inside standardize_wqdata
, which is called by clean_wqdata
Perhaps this functionality could be moved to calc_limits
and leave clean_wqdata
to deal with the replicates and outliers and so not remove variables that can't be matched to limit tables.
The substitute_*
functions compare against the codes
table, which is hand-built with variable names and units - manually updating that with all possible parameters from EMS is not feasible.
We could also possibly get substitute_*
functions (inside calc_limits
) to act against the limits
table rather than the codes
table, and make the codes
table include all of the units and variables from the rems::ems_parameters
.
I think it would be ideal to move the substitude_variables
and substitute_units
to calc_limits
. There is not a step between tidy_ems_data
(includes all 275 air and water parameters) and clean_wqdata
to retain those parameters that don't have limits. In my clean data, I have 71 parameters out of the 81 parameters associated with limits.
This has been my wqbc
water analysis workflow:
clean_wqdata
(after getting the data, and tidying it) output to view the data, create summary tables and write my general water quality characterization. calc_limits
output to create plots with the guideline and then report on any exceedances. My proposal:
substitude_variables
and substitute_units
to calc_limits
. I'm at a standstill in my analysis until the conductivity, carbon etc. parameters are retained in clean_wqdata
.
- Move the substitude_variables and substitute_units to calc_limits. I'm at a standstill in my analysis until the conductivity, carbon etc. parameters are retained in clean_wqdata.
I think this is right approach too. I'll see if I can give you a quick fix and then work on it more fully a bit later.
- I think I could finish my analysis in a few weeks, and while it's being reviewed, I can figure out what EMS parameter names correspond to what guidelines/limits (hopefully get a start before our work planning session!).
- For this I will need the EMS Parameter dictionary, limits.csv and codes.csv. Pretty sure you've told me this before, but can I get all these in an actual csv file? I know I can generate them in R and see them on github...
limits.csv: https://github.com/bcgov/wqbc/raw/master/data-raw/limits.csv codes.csv: https://github.com/bcgov/wqbc/raw/master/data-raw/codes.csv
For the ems paramater dictionary (or any of the other ems data dictionaries), probably the easiest is to export them from rems:
library(rems)
write.csv(ems_parameters, "parameters.csv", na = "")
Ok, a testable prototype in the clean-nodrop
branch.
I've moved standardize_wqdata
, which calls both substitute_variables
and substitute_units
, into calc_limits
, so now clean_wqdata
doesn't drop rows when it can't match a parameter to the codes
or limits
tables.
I've also made clean_wqdata
drop rows where Value == NA
as we discussed @HeatherGranger, as well as added a clean
argument to calc_limits
(default TRUE
) which you can set to FALSE
if you've already run clean_wqdata
so that calc_limits
doesn't do it again.
You can install with:
devtools::install_github("bcgov/wqbc", ref = "clean-nodrop")
Then this works for me:
library(wqbc)
library(rems)
foo <- get_ems_data()
filtered_data <- filter_ems_data(foo, emsid = "E307225",
from_date = "2017-01-01",
to_date = "2017-04-30")
## create some NA values to test that rows with NA Values are dropped
set.seed(42)
filtered_data$RESULT[sample(nrow(filtered_data), 50)] <- NA
tidy_filtered_data <- tidy_ems_data(filtered_data)
clean_filtered_data <- clean_wqdata(tidy_filtered_data, by = c("EMS_ID", "Station"))
calc_limits(clean_filtered_data, term = "long-daily", clean = FALSE)
I did this pretty quickly and haven't written any tests yet to make sure it's doing what we hope, so I'd love to hear how it works for you, and please don't put too much faith in the results yet! 😉
In using the clean-nodrop
branch so far, it appears to be doing what we want. I think next steps are building back up clean_wqdata
now that we've taken out the link to limits.csv
.
My suggested edits are:
Retain SAMPLE_STATE and SAMPLE_DESCRIPTOR columns during tidy_ems_data
.
If possible, create argument in clean_wqdata
(SAMPLE_STATE == Fresh Water, Wastewater, Ground Water or Marine Water). I don't think SAMPLE_DESCRIPTOR needs an argument, but is handy for information on the data frame.
Delete the parameters below all_data
as part of clean_wqdata
:
twoyear <- get_ems_data(which = "2yr", ask = FALSE) filtered_twoyear <- filter_ems_data(twoyear, emsid = c("410060","410094"), to_date = "2017/05/24") filtered_historic <- read_historic_data(emsid = c("410060","410094"), to_date = "2017/05/24", cols = "wq") bind_data <- bind_ems_data(filtered_twoyear, filtered_historic) tidy_data <- tidy_ems_data(bind_data, mdl_action = "mdl") all_data <- filter(all_data,!grepl('Barometric|Biomass|Chlorophyll|Flow|Silica|Air|Streptococcus|Salinity|Tannin|Surfactant|Moisture|Phaeophytin|Extractable|Extrac.|Extractble|Extractbl', Variable))
Change units for 30 parameters to ug/L and 3 to ng/L. I think there's 3 ways of doing this.
a) Referencing the limits table to match units. Of the parameters to change to ug/L, all the parameter names are the same in EMS as in the limits table except for Cyanide WAD, Chlorine Res:Total, Naphthalene C10H8 and Naphthalene d8 (this is as they appear from EMS).
b) Filter out these parameters, do the conversions, then bind them back to the data frame.
c) Use ifelse
or convert_values
?
I didn't get to writing any code for this as I know you've done some work already with units.
Let me know if you need anything @ateucher
addressing this in issue #143
'Alkalinity Total' and 'Alkalinity pH 4.5/4.2' (D102) are the only parameters kept during `clean_wqdata'. Thrown away are:
0101: Alkalinity Phen. 8.3 0102: Alkalinity Total 4.5
AK-T: Alkalinity:Total (I'm pretty sure this parameter is what is renamed to "Alkalinity Total" and retained during cleaning)