davidcarslaw / openair

Tools for air quality data analysis
https://davidcarslaw.github.io/openair/
GNU General Public License v2.0
302 stars 113 forks source link

Unexpectedly lenient behaviour of the data.thresh argument in timeAverage for data with poor data capture but no NAs. #271

Open liamswan opened 2 years ago

liamswan commented 2 years ago

While using the timeAverage function on data with poor data capture the data.thresh parameter does not seem to work as expected.

When missing values in the time series are filled with NA it works. But when there are gaps in the time series that are not filled with NA it no longer works. Setting the interval argument does not resolve the issue either

Using mydata from openair I can reproduce the issue.

library(openair) #openair       * 2.8-4    2021-09-15 [1] CRAN (R 4.1.1)
library(lubridate)
library(dplyr)

openair::mydata %>% select(date, pm25) %>% 
  filter(date %within% lubridate::interval("1998-01-01 00:00:00", 
                                           "1999-12-31 23:59:00")) %>% 
  filter(!is.na(pm25)) %>% 
  group_by(year(date)) %>% 
  count(.) %>% 
  mutate(data.capture = n/8760*100)

output:

# A tibble: 2 × 3
# Groups:   year(date) [2]
  `year(date)`     n data.capture
         <dbl> <int>        <dbl>
1         1998  4848         55.3
2         1999  7204         82.2

From this output I would expect that timeAverage would return NA for 1998 with data.thresh = 75. Instead 1998 does pass the 75% data.thresh when run through timeAverage.

openair::mydata %>% select(date, pm25) %>% 
  filter(date %within% lubridate::interval("1998-01-01 00:00:00", 
                                           "1999-12-31 23:59:00")) %>%
  filter(!is.na(pm25)) %>% 
  timeAverage(avg.time = "year", data.thresh = 75, interval = "hour")
# A tibble: 2 × 2
  date                 pm25
  <dttm>              <dbl>
1 1998-01-01 00:00:00  20.6
2 1999-01-01 00:00:00  22.3

This can of course be mended by supplying timeAverage with a full time series, in this case without removing NAs from mydata. This results in the expected output:

openair::mydata %>% select(date, pm25) %>% 
  filter(date %within% lubridate::interval("1998-01-01 00:00:00", 
                                           "1999-12-31 23:59:00")) %>%
  timeAverage(avg.time = "year", data.thresh = 75, interval = "hour")
# A tibble: 2 × 2
  date                 pm25
  <dttm>              <dbl>
1 1998-01-01 00:00:00  NA  
2 1999-01-01 00:00:00  22.3

It is my understanding that the interval argument should allow the function to determine data.capture % of the data even if it can not correctly guess the interval. Is this a real issue or am I expecting too much from the function?

Thanks!

schonhose commented 2 years ago

You are expecting too much from the function. Internally, in the utilities.R file there is the function date_pad which is sometimes applied to data to pad out missing time data. However, this function is not exported, so for internal use only (internal meaning available for the functions in the package, but not in the workspace).

The interval parameter of timeAverage is only there to supply the interval of the timeseries whenever automatic detection fails. In your case, the function behaves as expected. It detects the correct interval (hour), but it doesn't pad the timeseries. So what happens in your case is that you have a data coverage of 100% for the 1998 data, as you have values for all hours in your set.

The parameter interval does not calculate the maximum number of hours available in a calendar year, it only supplies the interval (hour in this case). In the first function you provided that information by entering the value 8760.

Bottom line, if you want to use timeAverage correctly you need to supply a complete timeseries, with NA for missing values. If the automatic detection of the interval within the timeseries fail for whatever reason, you can force it by setting interval. All calculations are performed without padding, so are based on what you provide. Hence, the length of the timeseries = the number of possible hours.