davidcarslaw / openair

Tools for air quality data analysis
https://davidcarslaw.github.io/openair/
GNU General Public License v2.0
307 stars 113 forks source link

TheilSen trend statistics #384

Open raffaele-morelli opened 6 months ago

raffaele-morelli commented 6 months ago

Question

Hi,

I am working on data with missing months:

image

Looking at MKresults$data[[2]] we see a table with two lines, one referring to 2019-06-27 and the other to 2018-11-08.

default p.stars date conc a b upper.a upper.b lower.a lower.b p slope intercept intercept.lower intercept.upper lower upper slope.percent lower.percent upper.percent
default ** 2019-06-27 16.11156 55.92575 -0.0023145 28.85148 -0.0007367 86.20022 -0.0040051 0.0033389 -0.8447789 55.92575 86.20022 28.85148 -1.461859 -0.2688846 -4.957488 -7.730342 -1.632108
default NA 2018-11-08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN -4.957488 -7.730342 -1.632108

Why two lines with all NaN except for slope.percent lower.percent upper.percent ?

Data around november 2018 follows obs date meteo v_norm
11 2018-08-14 12.686618 -1.686618
7 2018-08-26 10.537685 -3.537685
10 2018-10-29 12.546518 -2.546518
14 2018-10-30 12.760691 1.239309
55 2019-02-28 18.897843 36.102157
34 2019-03-01 13.481293 20.518707
42 2019-03-02 16.437182 25.562818
35 2019-03-03 16.784798 18.215202
14 2019-03-04 12.479097 1.520903
6 2019-03-05 10.300994 -4.300993
4 2019-03-06 8.724356 -4.724356
7 2019-03-07 9.382038 -2.382038
6 2019-03-08 10.920722 -4.920722
5 2019-03-09 12.816521 -7.816521

Regards

mooibroekd commented 3 months ago

Reprex to confirm:

library(openair) 
mary <- importAURN(site = "my1", year = c(seq(2000, 2009, 1), seq(2011, 2019, 1)))
result <- TheilSen(mary, pollutant = "no2")
#> Taking bootstrap samples. Please wait.


result$data[[2]]
#> # A tibble: 2 × 20
#>   default p.stars date        conc     a         b upper.a   upper.b lower.a
#>   <chr>   <chr>   <date>     <dbl> <dbl>     <dbl>   <dbl>     <dbl>   <dbl>
#> 1 default ***     2009-12-06  94.7  137.  -0.00305    118.  -0.00182    158.
#> 2 default <NA>    2010-06-16 NaN    NaN  NaN          NaN  NaN          NaN 
#> # ℹ 11 more variables: lower.b <dbl>, p <dbl>, slope <dbl>, intercept <dbl>,
#> #   intercept.lower <dbl>, intercept.upper <dbl>, lower <dbl>, upper <dbl>,
#> #   slope.percent <dbl>, lower.percent <dbl>, upper.percent <dbl>

Created on 2024-08-07 with reprex v2.1.1

However, you can see that the slope.percent, lower.percent and upper.percent are the same as the initial line. I suspect this is a bug when combining the data output, in the sense that those percentages are added twice. When calculating a single line, I would explicitly target the first results (i.e. result$data[[2]][1,])