bcgov / ssdtools

An R package to fit and plot Species Sensitivity Distributions (SSDs)
https://bcgov.github.io/ssdtools/
Apache License 2.0
30 stars 16 forks source link

Use of weighted results in surprising behaviour in the estimate #344

Open joethorley opened 7 months ago

joethorley commented 7 months ago
library(ssdtools)
#> Please replace the following in your scripts:
#> - `ssdtools::boron_data` with `ssddata::ccme_boron`
#> - `ssdtools::ccme_data` with `ssddata::ccme_data`
library(ssddata)

data <- ssddata::ccme_boron

data$Weight <- 1
data$Weight[rank(data$Conc) > 6] <- 1/10

fitall <- ssd_fit_dists(data, dists="lnorm")
ssd_hc(fitall)
#> # A tibble: 1 × 10
#>   dist    percent   est    se   lcl   ucl    wt method     nboot pboot
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <int> <dbl>
#> 1 average       5  1.68    NA    NA    NA     1 parametric     0    NA

fit1 <- ssd_fit_dists(subset(data, Weight == 1), dists="lnorm")
ssd_hc(fit1)
#> # A tibble: 1 × 10
#>   dist    percent   est    se   lcl   ucl    wt method     nboot pboot
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <int> <dbl>
#> 1 average       5  1.04    NA    NA    NA     1 parametric     0    NA

fit1w <- ssd_fit_dists(subset(data, Weight == 1), dists="lnorm", weight = "Weight")
ssd_hc(fit1w)
#> # A tibble: 1 × 10
#>   dist    percent   est    se   lcl   ucl    wt method     nboot pboot
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <int> <dbl>
#> 1 average       5  1.04    NA    NA    NA     1 parametric     0    NA

fitallw10 <- ssd_fit_dists(data, dists="lnorm", weight = "Weight")
ssd_hc(fitallw10)
#> # A tibble: 1 × 10
#>   dist    percent   est    se   lcl   ucl    wt method     nboot pboot
#>   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <int> <dbl>
#> 1 average       5 0.547    NA    NA    NA     1 parametric     0    NA

Created on 2024-01-22 with reprex v2.1.0

joethorley commented 7 months ago

The last value with all the unequally weighted data appears to be incorrect.

joethorley commented 7 months ago

However the calculations are correct.

As David Fox confirmed

I can further confirm that the introduction of weights in the log-likelihood function has been done correctly. I have validated this several ways: (i) writing my own code in R; (ii) performing the calculations outside of R with different mathematical/statistical software; (iii) and comparing with fitdistrplus.

I believe the matter of SSD weighting is still an unresolved issue because what ssdtools (and fitdistrplus) are doing is not what ecotoxciologists want or expect.

joethorley commented 6 months ago

We will turn off this functionality until a better solution is found.

atillmanns commented 6 months ago

I disagree with what David Fox has written above. Weighting the data in a tail end of a distribution will likely increase the length of the tails. This is because these data points will be the focus of fitting the distribution. This could be useful if the species in the tails are of great importance and if there are few data points to represent the class of taxa that are found to be sensitive. In this case it would be a useful too to have to weight the distribution more on these species so an estimate can be made that will be protective of that group of species. I do not think this function should be removed as it is not erroneous. It just does not behave similar to weighting when it is done in a regression. I therefore think this functionality should not be removed.