bcgov / wqbc

An R package for water quality thresholds and index calculation for British Columbia
http://bcgov.github.io/wqbc/
Apache License 2.0
20 stars 9 forks source link

`summarize_wqdata()` function throws with some datasets #163

Open aylapear opened 2 years ago

aylapear commented 2 years ago

Two examples using the same EMS_ID/Station but different parameters/variables and in one case the function summarize_wqdata() works and provides a summary table while in the other case the table throws an error

Example where it works properly
``` data_works <- tibble::tibble( EMS_ID = c("0200016", "0200016", "0200016"), Station = c("ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"), Variable = c("Nitrogen Total", "Nitrogen Total", "Nitrogen Total"), Code = c("0114", "0114", "0114"), Value = c(0.844, 0.949, 0.754), Units = c("mg/L", "mg/L", "mg/L"), DetectionLimit = c(0.03, 0.03, 0.03), ResultLetter = c(NA, NA, NA), Date = c("2021-11-07", "2021-11-21", "2021-12-05"), Outlier = c(FALSE, FALSE, FALSE), Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93"), UPPER_DEPTH = as.factor(c(NA, NA, NA)), Detected = as.factor(c("TRUE", "TRUE", "TRUE")), Timeframe = as.factor(c("2021", "2021", "2021")) ) wqbc::summarise_wqdata( data_works, by = c("EMS_ID"), censored = TRUE, na.rm = TRUE ) ``` Output ``` # A tibble: 1 × 14 Variable EMS_ID n ncen min max mean median lowerQ upperQ sd se lowerCL upperCL 1 Nitrogen Total 0200016 3 0 0.754 0.949 0.849 0.845 0.793 0.901 0.0799 0.0461 0.763 0.944 ```
Example where it fails
``` data_fails <- tibble::tibble( EMS_ID = c("0200016", "0200016", "0200016", "0200016", "0200016"), Station = c("ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93"), Variable = c("Aluminum Total", "Aluminum Total","Aluminum Total","Aluminum Total","Aluminum Total"), Code = c("AL-T", "AL-T", "AL-T", "AL-T", "AL-T"), Value = c(0.031, 0.0192, 0.397, 0.0183, 0.1), Units = c("mg/L", "mg/L", "mg/L", "mg/L", "mg/L"), DetectionLimit = c(0.5, 0.5, 0.5, 0.5, 0.5), ResultLetter = c(NA, NA, NA, NA, NA), Date = c("2020-01-05","2020-01-27", "2020-02-02","2020-02-17","2020-03-01"), Outlier = c(FALSE, FALSE, FALSE, FALSE, FALSE), Site_Renamed = c("ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93","ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93", "ELK RIVER ABOVE HIGHWAY 93"), UPPER_DEPTH = as.factor(c(NA, NA, NA, NA, NA)), Detected = as.factor(c("FALSE", "FALSE", "FALSE", "FALSE", "FALSE")), Timeframe = as.factor(c("2020", "2020", "2020", "2020", "2020")) ) wqbc::summarise_wqdata( data_fails, by = c("EMS_ID"), censored = TRUE, na.rm = TRUE ) ``` Output ``` Error in names(ret) <- c("mean", "se", LCL(x), UCL(x)) : 'names' attribute [4] must be the same length as the vector [0] In addition: Warning message: In survreg.fit(X, Y, weights, offset, init = init, controlvals = control, : Ran out of iterations and did not converge Backtrace: ▆ 1. └─wqbc::summarise_wqdata(...) 2. └─plyr::ddply(...) 3. └─plyr::ldply(...) 4. └─plyr::llply(...) 5. ├─plyr:::loop_apply(n, do.ply) 6. └─plyr ``(1L) 7. └─wqbc .fun(piece, ...) 8. ├─base::mean(ml) 9. └─NADA::mean(ml) 10. └─NADA .local(x, ...) ```
aylapear commented 2 years ago

@joethorley

joethorley commented 2 years ago

thanks @aylapear - I'll look into

HeatherGranger commented 2 years ago

@joethorley do you remember if this is still ongoing? if so, perhaps something to examine what the dependencies are and what's worth updating in it's current format.

aylapear commented 1 year ago

The cause of this is error when all the data points are censored, specifically when the Censored = TRUE for every row.

# real data output
EMS_ID                    Station       Variable Code Value Units DetectionLimit ResultLetter       Date
1 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-05
2 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-01-27
3 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-02
4 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-02-17
5 0200016 ELK RIVER ABOVE HIGHWAY 93 Aluminum Total AL-T   0.5  mg/L            0.5           NA 2020-03-01
  Outlier               Site_Renamed UPPER_DEPTH Detected Timeframe Censored
1   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
2   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
3   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
4   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE
5   FALSE ELK RIVER ABOVE HIGHWAY 93        <NA>    FALSE      2020     TRUE

Example to show which code is throwing error

# this throws error
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(TRUE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)
# As long as one censored value is false it will run
df <- tibble(
  Value = c(0.844, 0.949, 0.754),
  Censored = c(FALSE, TRUE, TRUE)
)

ml <- with(
  df,
  cenmle(
    Value,
    Censored,
    dist = "lognormal",
    conf.int = 0.95
  )
)
ml

est <- mean(ml)
aylapear commented 1 year ago

This error is caused by the internal function summarise_wqdata_by() in the summaries-wqdata.R file

aylapear commented 1 year ago

It also fails on a single value

> df <- tibble(
+   Value = c(0.011),
+   Censored = c(TRUE)
+ )
> ml <- with(
+   df,
+   cenmle(
+     Value,
+     Censored,
+     dist = "lognormal",
+     conf.int = 0.95
+   )
+ )
> ml
Error in exp(x@survreg$coef[1]) : 
  non-numeric argument to mathematical function
# fails
df <- tibble(
  Value = c(0.011),
  Censored = c(FALSE)
)
# passes when two values even if one is censored 
df <- tibble(
  Value = c(0.011, 0.0001),
  Censored = c(FALSE, TRUE)
)
aylapear commented 1 year ago

In these edge cases when either all values are censored or only a single value is given the function should return NA' s instead of an error.

aylapear commented 1 year ago

This would then generate this table when the site that has no data display NA's instead of throwing an error.

# A tibble: 2 × 14
  Variable       EMS_ID      n  ncen   min   max  mean median lowerQ upperQ     sd     se lowerCL upperCL
  <chr>          <chr>   <int> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
1 Nitrogen Total 0200016     3     0 0.754 0.949 0.849  0.845  0.793  0.901 0.0799 0.0461   0.763   0.944
1 Nitrogen Total 0478416     NA          NA      NA      NA       NA       NA      NA        NA       NA        NA        NA
joethorley commented 1 year ago

I agree - and when no data it should return a table with the same columns and classes and no rows.