Closed jakegreif closed 1 year ago
Applying the logic for this function is more complicated than expected. Currently, the function runs, but not with the complete functionality we need (clean = FALSE has not been coded, it just returns the input data frame unchanged). This is enough to test/move forward with shiny development of the Result Flags page.
Takeaways from first attempt:
We'll ask Kevin for the list of existing values/characters for the ResultMeasureValueField and use that to inform conditional steps for each scenario to change class of the column.
From Jake: "All should be working, but these functions bring up a critical problem we need to address -- how do we deal with the class of ResultValueMeasure (and all other fields that we need to do calculations on, like depth data)? It's read in as character, and when we change to numeric, some values are coerced to NA? is this ok? Let's address this sooner rather than later."
Cristina: Hopefully we can address this issue with this function. This is probably where we want to first retain the original values in a separate column. That will retain the non-numeric results, but not convert them and flag those rows as "non-numeric results". Over time we can try to move the non-numeric information for each special char type to a separate column so the result values can be used (i.e. move the <, >, ~, *, etc.).
This function could happen on retrieval and/or as part of the other relevant functions that impact these result fields: • ResultMeasureValue • DetectionQuantitationLimitMeasure/MeasureValue • ActivityBottomDepthHeightMeasure/MeasureValue • ActivityDepthHeightMeasure/MeasureValue • ActivityTopDepthHeightMeasure/MeasureValue • ResultDepthHeightMeasure/MeasureValue
From Kevin: It is difficult to provide a list of ALL special characters (not numeric) • “>” present above quantification limit • “<” present below quantification limit • Not-Detected • ND • Etc.. any acronym and text
potentially move this function to utilities.R file and use it within others
Decision to focus on these two for now: • ResultMeasureValue • DetectionQuantitationLimitMeasure.MeasureValue
Create copies:
Create flag & edit fields as needed to remove/edit non-numeric values:
Add later: "~" = "approximate" "ND" or "Non Detect" = "ND"
WQX does have the following guidance for data submissions regarding the < or > issue: • Convert Result Values that start with "<" or ">" into an appropriate Detection Condition and Detection Limit Value. o For example: a Result Value of "<0.25" would be converted into a Detection Condition of "Present Below Quantification Limit", a Detection Limit Value of "0.25", and a Detection Limit Type of "Lower Quantitation Limit"
We can implement the "<" = "less than" and ">" = "greater than" logic for this function and then fix it fully when we do the nondetec substitutions using the flag field ("less than" or "greater than").
I agree the ones we can reasonably able to interpret are <, >, ~, and commas. We plan to implement the logic above for < or > symbols, and to simply remove the ~ or commas where possible. If we learn of others that are important we can always make the function smarter in the future. For now, ones we do not know of or have a solution for will be coerced to NA (this includes all text fields).
Edits/updates from last comment have been made, in addition to including a flag argument that only generates a flag column when flag = TRUE. The flag columns require cumbersome code that slows down the function, so cutting them out by default (flag = FALSE) improves efficiency of the function.
Some things to consider adding down the line:
@jakegreif What happens when units are meant to be blank, unitless, none? e.g. pH
scientific notation and issue with certain decimal numbers:
scientific notation and issue with certain decimal numbers:
I don't have time to try this, but consider looking into how to check for the letter 'e' after a numeric character and flag as 'Scientific Notation'.
nitrogen <- nitrogen %>% mutate(ResultComment= case_when(grepl("<", Result) ~ "<", grepl(">", Result) ~ ">", grepl("BDL", Result) ~ "<", grepl("bdl", Result) ~ "<", grepl("below detection limits", Result) ~ "<", grepl("Not Detected", ResultDetectionConditionText) ~ "<", grepl("Present Below Quantification Limit", ResultDetectionConditionText) ~ "<"))
add the following for both result and depth fields
mutate(across(where(is.character), ~ na_if(.,""))) %>%
mutate(across(where(is.character), ~ na_if(.," "))) %>%
mutate(across(where(is.character), ~ na_if(.,"<Blank>")))
@ehinman I wanted to flag this issue as well since it is about the MeasureValueSpecialCharacters function, which will impact the new censored data functions too
Do you want to consider percentages for conversion to numeric? e.g. 50% becomes 50 and the flag is "Percentage"? I noticed in this query, nearly all of the result measures are from a "choice list" that is clearly percentages:
tada3 <- TADABigdataRetrieval(characteristicName = "Algae, substrate rock/bank cover (choice list)", siteType = "Stream") tada3 = ConvertSpecialChars(tada3, "ResultMeasureValue")
I'm not sure how prevalent percentages are in WQ assessment data, but if the aim is to squeeze as much as we can out of each record, this might be another way to make more data usable quantitatively.
What is the unit supplied in this case? Can we simply remove the % from the result value and add that as a unit?
When I've seen it, I think the unit is "choice list" or similar. I suppose we could convert the % to a unit. For now, the function I am working on notes that it is a percentage in the flag column that comes out with the resultant df.
That sounds like a good approach for now. In a future update, if we know for sure it is a percentage I would feel comfortable changing the TADA unit field to %. That way in the shiny app and other functions down the line will have % there instead of "choice list" which is ambiguous/not very useful in this case.
@cristinamullin Sounds good to me! That would be easy to implement.
@cristinamullin Another sticky example entry to discuss: "# - #". I'm noticing in the depth height measure columns, sometimes people put in "0-2" as a depth measurement. Currently, this is "Coerced to NA". I'd be interested to hear (a) how people use these ranges in their analyses and (b) how they would suggest handling it in TADA. My first instinct is to calculate an average so these values are usable, e.g. "0-2" becomes 1. But I may be misinterpreting the entry. For example, maybe result value is representative of all depths between 0 and 2, as opposed to representative of a depth somewhere between 0 and 2.
Great catch!
Yes and Yes: it is safe to assume in this case the value is BOTH "representative of all depths between 0 and 2" and "representative of a depth somewhere between 0 and 2".
The upper 2m is typically referred to as the "surface" layer, and I think it is safe to assume measure values with 0-2m here are representative of the "surface". However if we want to plot this on a depth profile, we could use the avg like you suggest, or assume the value is the same from x = 0 to 2m down (either option works depending on if we are drawing a line vs scatter plot).
However for making this field numeric, let's go with the avg (1m in this example) as the NUMERIC value for the depth field. There is also a metadata element where users can specify that this is a "surface" result, and we could/should probably make sure that is filled out here too in this step.
Let me know if you find others (not just 0-2), as we likely need slightly different logic for each scenario
@ehinman I think most of this is complete now. Can you please review & create new issues on specific topics here if there is anything outstanding? Thank you!
Sounds good!
Page Requirements/Standards
All of the flag page functions should be consistent in the following ways:
Function Requirements
Development Notes
ResultFlagsDependent.R