IEA-EBC-Annex86 / annex

R package to process and store data for the IEA EBC - Annex 86 - Energy Efficient Indoor Air Quality Management in Residential Buildings project. For details visit https://annex86.iea-ebc.org/. Documentation and examples available on https://iea-ebc-annex86.github.io/annex/
2 stars 0 forks source link

Statistics for groups with 1 (or only few) values #17

Closed retostauffer closed 6 months ago

retostauffer commented 10 months ago

Dear @gabroko

something for Friday!

The problem

Whilst processing our huge data set I ran into an issue today. When calculating the statistics it is possible that we have a sample size of 1 for a specific groups and I am not sure what would be best to do here.

The reason (simplified)

An example: We have data for a specific study/home/room/variable starting "December 31, 23:55" and ends a few months later. If we assume we have 1 observation every 10 minutes, we will end up with one observation for "December" that year.

The effects

In this case our software properly calculates that we have N = 1 and we can no longer guess/estimate the interval of the logger. Thus, we end up with no interval_* which currently causes an error in annex_validate().

In addition, the standard deviation is obviously also missing and the percentiles (based on one value) is just a constant value.

Real-life

This happens for (one) of the loggers I am processing right now for one room where we don't have N = 1 but only one non-missing value. The software properly reports that (as an example) have N = 825 values, but thereof we have NAs = 824 and, thus, only one non-missing value causing the exact same problem outlined above.

Solutions to think about

retostauffer commented 10 months ago

Btw, here is a simplified example (simplified) which might help to discuss this later this week.

user  study  home                               room  year  month tod
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  5     all
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  all   07-23
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5  2022  5     07-23

variable quality_lower quality_upper quality_start quality_end
CO2      0                 0         44692         44692
CO2      0                 0         44692         44692

interval_Min interval_Q1 interval_Median interval_Mean interval_Q3 interval_Max
NA           NA          NA              NA            NA          NA
NA           NA          NA              NA            NA          NA
NA           NA          NA              NA            NA          NA

Nestim   N    NAs  Mean  Sd   p00    ...    p100
NA       825  824  685   NA   685    ...    685
NA       582  581  685   NA   685    ...    685
NA       582  581  685   NA   685    ...    685
retostauffer commented 10 months ago

@gabroko fyi; in the current version of the master branch I've (kind of arbitrarily) set the minimum number of required valid observations to 10.

Thus, if the number N - NAs (the number of observations/records in the data set minus the number of missing values in the same) is below 10, I am leaving the following columns in the statistics empty; trained annex_validate() to consider this as "fine".

retostauffer commented 10 months ago

As decided today we will not provide Mean and Sd if number of valid observations is <30 (asymptotic).

retostauffer commented 10 months ago

@gabroko implemented and updated the current version of our package (HEAD main). Below you can see a screenshot of a subset of the columns from the stats object.

A minimal to reproduce this example can be found as a gist: https://gist.github.com/retostauffer/481297231c70efccf5158a7e654314f6

Screenshot from 2023-12-22 07-28-01

If this is fine to you we can close this issue.