`clean_wqdata` throws away parameters not in `wqbc::codes`

bcgov / wqbc

An R package for water quality thresholds and index calculation for British Columbia

http://bcgov.github.io/wqbc/

Apache License 2.0

20 stars 9 forks source link

`clean_wqdata` throws away parameters not in `wqbc::codes` #76

Closed HeatherGranger closed 4 years ago

HeatherGranger commented 7 years ago

'Alkalinity Total' and 'Alkalinity pH 4.5/4.2' (D102) are the only parameters kept during `clean_wqdata'. Thrown away are:

0101: Alkalinity Phen. 8.3 0102: Alkalinity Total 4.5

AK-T: Alkalinity:Total (I'm pretty sure this parameter is what is renamed to "Alkalinity Total" and retained during cleaning)

ateucher commented 7 years ago

As this is the same issue as #75, I'm going to close that one and rename this one to reflect the more general issue.

The issue is that there is a predefined list (wqbc::codes) that is much smaller than what is in ems, and anything that is not in that list is thrown away. The tricky part is that there only limits for a small subset (wqbc::limits), and the units vary depending on the analytical method. Started work in the fix-drop-variables branch.

ateucher commented 7 years ago

Brain dump as I won't be able to get back to this until later next week...

Two things need to happen for valid comparison against limits/guidelines:

We need to know what variables we are talking about (e.g., Nitrogen dissolved, Nitrogen total, etc compared to the "Nitrogen" guideline) - this is done via substitute_variables inside standardize_wqdata, which is called by clean_wqdata
Units in the data need to be comparable against the units that the limits are specified in. This is done via substitute_units inside standardize_wqdata, which is called by clean_wqdata

Perhaps this functionality could be moved to calc_limits and leave clean_wqdata to deal with the replicates and outliers and so not remove variables that can't be matched to limit tables.

The substitute_* functions compare against the codes table, which is hand-built with variable names and units - manually updating that with all possible parameters from EMS is not feasible.

We could also possibly get substitute_* functions (inside calc_limits) to act against the limits table rather than the codes table, and make the codes table include all of the units and variables from the rems::ems_parameters.

HeatherGranger commented 7 years ago

I think it would be ideal to move the substitude_variables and substitute_units to calc_limits. There is not a step between tidy_ems_data (includes all 275 air and water parameters) and clean_wqdata to retain those parameters that don't have limits. In my clean data, I have 71 parameters out of the 81 parameters associated with limits.

This has been my wqbc water analysis workflow:

Using clean_wqdata (after getting the data, and tidying it) output to view the data, create summary tables and write my general water quality characterization.
Using calc_limits output to create plots with the guideline and then report on any exceedances.

My proposal:

Move the substitude_variables and substitute_units to calc_limits. I'm at a standstill in my analysis until the conductivity, carbon etc. parameters are retained in clean_wqdata.
I think I could finish my analysis in a few weeks, and while it's being reviewed, I can figure out what EMS parameter names correspond to what guidelines/limits (hopefully get a start before our work planning session!).
For this I will need the EMS Parameter dictionary, limits.csv and codes.csv. Pretty sure you've told me this before, but can I get all these in an actual csv file? I know I can generate them in R and see them on github...

ateucher commented 7 years ago

Move the substitude_variables and substitute_units to calc_limits. I'm at a standstill in my analysis until the conductivity, carbon etc. parameters are retained in clean_wqdata.

I think this is right approach too. I'll see if I can give you a quick fix and then work on it more fully a bit later.

I think I could finish my analysis in a few weeks, and while it's being reviewed, I can figure out what EMS parameter names correspond to what guidelines/limits (hopefully get a start before our work planning session!).

For this I will need the EMS Parameter dictionary, limits.csv and codes.csv. Pretty sure you've told me this before, but can I get all these in an actual csv file? I know I can generate them in R and see them on github...

limits.csv: https://github.com/bcgov/wqbc/raw/master/data-raw/limits.csv codes.csv: https://github.com/bcgov/wqbc/raw/master/data-raw/codes.csv

For the ems paramater dictionary (or any of the other ems data dictionaries), probably the easiest is to export them from rems:

library(rems)
write.csv(ems_parameters, "parameters.csv", na = "")

ateucher commented 7 years ago

Ok, a testable prototype in the clean-nodrop branch.

I've moved standardize_wqdata, which calls both substitute_variables and substitute_units, into calc_limits, so now clean_wqdata doesn't drop rows when it can't match a parameter to the codes or limits tables.

I've also made clean_wqdata drop rows where Value == NA as we discussed @HeatherGranger, as well as added a clean argument to calc_limits (default TRUE) which you can set to FALSE if you've already run clean_wqdata so that calc_limits doesn't do it again.

You can install with:

devtools::install_github("bcgov/wqbc", ref = "clean-nodrop")

Then this works for me:

library(wqbc)
library(rems)

foo <- get_ems_data()
filtered_data <- filter_ems_data(foo, emsid = "E307225",
from_date = "2017-01-01",
to_date = "2017-04-30")

## create some NA values to test that rows with NA Values are dropped
set.seed(42)
filtered_data$RESULT[sample(nrow(filtered_data), 50)] <- NA

tidy_filtered_data <- tidy_ems_data(filtered_data)

clean_filtered_data <- clean_wqdata(tidy_filtered_data, by = c("EMS_ID", "Station"))

calc_limits(clean_filtered_data, term = "long-daily", clean = FALSE)

I did this pretty quickly and haven't written any tests yet to make sure it's doing what we hope, so I'd love to hear how it works for you, and please don't put too much faith in the results yet! 😉

HeatherGranger commented 7 years ago

In using the clean-nodrop branch so far, it appears to be doing what we want. I think next steps are building back up clean_wqdata now that we've taken out the link to limits.csv.

My suggested edits are:

Retain SAMPLE_STATE and SAMPLE_DESCRIPTOR columns during tidy_ems_data.
If possible, create argument in clean_wqdata (SAMPLE_STATE == Fresh Water, Wastewater, Ground Water or Marine Water). I don't think SAMPLE_DESCRIPTOR needs an argument, but is handy for information on the data frame.
Delete the parameters below all_data as part of clean_wqdata: twoyear <- get_ems_data(which = "2yr", ask = FALSE) filtered_twoyear <- filter_ems_data(twoyear, emsid = c("410060","410094"), to_date = "2017/05/24") filtered_historic <- read_historic_data(emsid = c("410060","410094"), to_date = "2017/05/24", cols = "wq") bind_data <- bind_ems_data(filtered_twoyear, filtered_historic) tidy_data <- tidy_ems_data(bind_data, mdl_action = "mdl") all_data <- filter(all_data,!grepl('Barometric|Biomass|Chlorophyll|Flow|Silica|Air|Streptococcus|Salinity|Tannin|Surfactant|Moisture|Phaeophytin|Extractable|Extrac.|Extractble|Extractbl', Variable))
Change units for 30 parameters to ug/L and 3 to ng/L. I think there's 3 ways of doing this.

a) Referencing the limits table to match units. Of the parameters to change to ug/L, all the parameter names are the same in EMS as in the limits table except for Cyanide WAD, Chlorine Res:Total, Naphthalene C10H8 and Naphthalene d8 (this is as they appear from EMS).

b) Filter out these parameters, do the conversions, then bind them back to the data frame.

c) Use ifelse or convert_values ?

I didn't get to writing any code for this as I know you've done some work already with units.

Let me know if you need anything @ateucher

HeatherGranger commented 4 years ago

addressing this in issue #143