jakegreif commented 2 years ago

Page Requirements/Standards

All of the flag page functions should be consistent in the following ways:

Clean argument indicates whether flag columns should be appended to the data (clean = FALSE), or flagged data is transformed/filtered from the dataset and no columns are appended (clean = TRUE).
Default is clean = FALSE

Function Requirements

Reference table required? No
Include warning if flags are not applied? Yes, "Data summaries and calculations may be affected by choosing to retain special characters in the ResultValue field. In order to ensure transformation functions will run properly, set clean = TRUE."
Required columns: ResultMeasureValue

Development Notes

ResultFlagsDependent.R

jakegreif commented 2 years ago

Applying the logic for this function is more complicated than expected. Currently, the function runs, but not with the complete functionality we need (clean = FALSE has not been coded, it just returns the input data frame unchanged). This is enough to test/move forward with shiny development of the Result Flags page.

Takeaways from first attempt:

dataRetrieval reads column as character class
Initial approach was to search for strings that are not numeric (numbers, periods, commas?) and flag rows where that's TRUE (sample data proved this doesn't work -- some values are ".")
Converting to numeric [as.numeric(data$ResultMeasureValue)] removes any value/string that is not entirely numeric (e.g. <5 is converted to NA, not 5)
Challenge: how do we identify numeric strings without misidentifying cells with no numbers, like the sample data example bolded above?
- Additionally, how do we handle the ResultMeasureValue class (defaults to character) if clean = FALSE? It has to be changed at some point to run stats, conversions, etc. -- should that happen here, in the checks of the stats/conversions functions, or somewhere in between?
Next steps: Find code that can search for strings that have numbers and periods or commas.
- Function logic, clean = FALSE: Flag rows where a value/string is is not a number, a period between numbers or a comma between numbers
- Function logic, clean = TRUE: Remove any value/character that is not a number, a period between numbers or a comma between numbers. Change class of column to numeric.

jakegreif commented 2 years ago

We'll ask Kevin for the list of existing values/characters for the ResultMeasureValueField and use that to inform conditional steps for each scenario to change class of the column.

cristinamullin commented 2 years ago

From Jake: "All should be working, but these functions bring up a critical problem we need to address -- how do we deal with the class of ResultValueMeasure (and all other fields that we need to do calculations on, like depth data)? It's read in as character, and when we change to numeric, some values are coerced to NA? is this ok? Let's address this sooner rather than later."

Cristina: Hopefully we can address this issue with this function. This is probably where we want to first retain the original values in a separate column. That will retain the non-numeric results, but not convert them and flag those rows as "non-numeric results". Over time we can try to move the non-numeric information for each special char type to a separate column so the result values can be used (i.e. move the <, >, ~, *, etc.).

This function could happen on retrieval and/or as part of the other relevant functions that impact these result fields: • ResultMeasureValue • DetectionQuantitationLimitMeasure/MeasureValue • ActivityBottomDepthHeightMeasure/MeasureValue • ActivityDepthHeightMeasure/MeasureValue • ActivityTopDepthHeightMeasure/MeasureValue • ResultDepthHeightMeasure/MeasureValue

From Kevin: It is difficult to provide a list of ALL special characters (not numeric) • “>” present above quantification limit • “<” present below quantification limit • Not-Detected • ND • Etc.. any acronym and text

potentially move this function to utilities.R file and use it within others

cristinamullin commented 2 years ago

Decision to focus on these two for now: • ResultMeasureValue • DetectionQuantitationLimitMeasure.MeasureValue

Create copies:

ResultMeasureValue.original
DDetectionQuantitationLimitMeasure.MeasureValue.original

Create flag & edit fields as needed to remove/edit non-numeric values:

ResultMeasureValue.flag
DetectionQuantitationLimitMeasure.MeasureValue.flag Then, actually remove result special chars when possible, flag all else: "NA" = "ND or NA" "." = "ND or NA" "Text" = "Text" Non-numeric/unidentified special chars = "CoercedtoNA" "<" = "less than" ">" = "greater than"

Add later: "~" = "approximate" "ND" or "Non Detect" = "ND"

cristinamullin commented 2 years ago

WQX does have the following guidance for data submissions regarding the < or > issue: • Convert Result Values that start with "<" or ">" into an appropriate Detection Condition and Detection Limit Value. o For example: a Result Value of "<0.25" would be converted into a Detection Condition of "Present Below Quantification Limit", a Detection Limit Value of "0.25", and a Detection Limit Type of "Lower Quantitation Limit"

We can implement the "<" = "less than" and ">" = "greater than" logic for this function and then fix it fully when we do the nondetec substitutions using the flag field ("less than" or "greater than").

I agree the ones we can reasonably able to interpret are <, >, ~, and commas. We plan to implement the logic above for < or > symbols, and to simply remove the ~ or commas where possible. If we learn of others that are important we can always make the function smarter in the future. For now, ones we do not know of or have a solution for will be coerced to NA (this includes all text fields).

jakegreif commented 2 years ago

Edits/updates from last comment have been made, in addition to including a flag argument that only generates a flag column when flag = TRUE. The flag columns require cumbersome code that slows down the function, so cutting them out by default (flag = FALSE) improves efficiency of the function.

Some things to consider adding down the line:

Note any new special characters to add the list of special characters to check for (e.g. MD Department of Env Quality Shellfish Data uses "." for result values, even when no detection limit data is provided)
When reviewing ResultMeasureValue, check for data in the Detection Condition field if it's not numeric. If detection limit data exists, ResultMeasureValueFlag = "ND"
Roop mentioned that NJ converts results with "<" or ">" to detection limit data. (e.g. when result value is "<0.25", populate the detection condition field as "Present Below Quantification Limit," a detection limit value of "0.25" and detection limit type of "Lower Quantitation Limit"

cristinamullin commented 2 years ago

@jakegreif What happens when units are meant to be blank, unitless, none? e.g. pH

cristinamullin commented 2 years ago

scientific notation and issue with certain decimal numbers:

643a497a-60b2-490d-89cb-a1619565f94b 76d7b824-9ecb-403f-b151-435fb0c54b9f

jakegreif commented 2 years ago

scientific notation and issue with certain decimal numbers:

I don't have time to try this, but consider looking into how to check for the letter 'e' after a numeric character and flag as 'Scientific Notation'.

cristinamullin commented 2 years ago

make sure the following censoring information is retained

nitrogen <- nitrogen %>% mutate(ResultComment= case_when(grepl("<", Result) ~ "<", grepl(">", Result) ~ ">", grepl("BDL", Result) ~ "<", grepl("bdl", Result) ~ "<", grepl("below detection limits", Result) ~ "<", grepl("Not Detected", ResultDetectionConditionText) ~ "<", grepl("Present Below Quantification Limit", ResultDetectionConditionText) ~ "<"))