Statistics for groups with 1 (or only few) values

retostauffer commented 10 months ago

Dear @gabroko

something for Friday!

The problem

Whilst processing our huge data set I ran into an issue today. When calculating the statistics it is possible that we have a sample size of 1 for a specific groups and I am not sure what would be best to do here.

The reason (simplified)

An example: We have data for a specific study/home/room/variable starting "December 31, 23:55" and ends a few months later. If we assume we have 1 observation every 10 minutes, we will end up with one observation for "December" that year.

The effects

In this case our software properly calculates that we have N = 1 and we can no longer guess/estimate the interval of the logger. Thus, we end up with no interval_* which currently causes an error in annex_validate().

In addition, the standard deviation is obviously also missing and the percentiles (based on one value) is just a constant value.

Real-life

This happens for (one) of the loggers I am processing right now for one room where we don't have N = 1 but only one non-missing value. The software properly reports that (as an example) have N = 825 values, but thereof we have NAs = 824 and, thus, only one non-missing value causing the exact same problem outlined above.

Solutions to think about

Tell annex_validate() to consider it OK if interval_* and Sd is missing if N - NAs == 1
Remove these cases from the stats completely
Set the percentiles (p*) to NA as well and tell annex_validate() that this is what we expect if N - NAs == 1.

retostauffer commented 10 months ago

Btw, here is a simplified example (simplified) which might help to discuss this later this week.

user  study  home                               room  year  month tod
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  5     all
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  all   07-23
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  5     07-23

variable quality_lower quality_upper quality_start quality_end
CO2      0                 0         44692         44692
CO2      0                 0         44692         44692

interval_Min interval_Q1 interval_Median interval_Mean interval_Q3 interval_Max
NA           NA          NA              NA            NA          NA
NA           NA          NA              NA            NA          NA
NA           NA          NA              NA            NA          NA

Nestim   N    NAs  Mean  Sd   p00    ...    p100
NA       825  824  685   NA   685    ...    685
NA       582  581  685   NA   685    ...    685
NA       582  581  685   NA   685    ...    685

retostauffer commented 10 months ago

@gabroko fyi; in the current version of the master branch I've (kind of arbitrarily) set the minimum number of required valid observations to 10.

Thus, if the number N - NAs (the number of observations/records in the data set minus the number of missing values in the same) is below 10, I am leaving the following columns in the statistics empty; trained annex_validate() to consider this as "fine".

Commit #638adb1b88b3a45922a6e9ffb44b472445ede246
Forcing to zero (regex): "^(interval_.*|Nestim|Mean|Sd|p[0-9\\.]+)$ interval estimates, number of estimated observations (relies on the estimated interval), mean, sd as well as all percentiles. Thus, all left is quality flags and sample sizes. TBD.

retostauffer commented 10 months ago

As decided today we will not provide Mean and Sd if number of valid observations is <30 (asymptotic).

retostauffer commented 10 months ago

@gabroko implemented and updated the current version of our package (HEAD main). Below you can see a screenshot of a subset of the columns from the stats object.

If N - NAs >= 30: provide Mean, Sd as well as interval_* estimates, Nestim and percentiles.
If N - NAs < 30 & N - NAs > 1: setting Mean and Sd to NA but keep the rest
If N - NAs == 1: no longer able to properly estimate intervals, interval_* is NA as well as Nestim.

A minimal to reproduce this example can be found as a gist: https://gist.github.com/retostauffer/481297231c70efccf5158a7e654314f6

Screenshot from 2023-12-22 07-28-01

If this is fine to you we can close this issue.

IEA-EBC-Annex86 / annex