Closed retostauffer closed 6 months ago
Btw, here is a simplified example (simplified) which might help to discuss this later this week.
user study home room year month tod
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5 2022 5 all
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5 2022 all 07-23
0008 XXXXX 01aa8ba5-1f4b-4560-ac6a-c515c06cdefa BAT5 2022 5 07-23
variable quality_lower quality_upper quality_start quality_end
CO2 0 0 44692 44692
CO2 0 0 44692 44692
interval_Min interval_Q1 interval_Median interval_Mean interval_Q3 interval_Max
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
Nestim N NAs Mean Sd p00 ... p100
NA 825 824 685 NA 685 ... 685
NA 582 581 685 NA 685 ... 685
NA 582 581 685 NA 685 ... 685
@gabroko fyi; in the current version of the master branch I've (kind of arbitrarily) set the minimum number of required valid observations to 10.
Thus, if the number N - NAs
(the number of observations/records in the data set minus the number of missing values in the same) is below 10, I am leaving the following columns in the statistics empty; trained annex_validate()
to consider this as "fine".
"^(interval_.*|Nestim|Mean|Sd|p[0-9\\.]+)$
interval estimates, number of estimated observations (relies on the estimated interval), mean, sd as well as all percentiles. Thus, all left is quality flags and sample sizes. TBD.As decided today we will not provide Mean
and Sd
if number of valid observations is <30
(asymptotic).
@gabroko implemented and updated the current version of our package (HEAD main). Below you can see a screenshot of a subset of the columns from the stats object.
N - NAs >= 30
: provide Mean
, Sd
as well as interval_*
estimates, Nestim
and percentiles.N - NAs < 30 & N - NAs > 1
: setting Mean
and Sd
to NA
but keep the restN - NAs == 1
: no longer able to properly estimate intervals, interval_*
is NA
as well as Nestim
.A minimal to reproduce this example can be found as a gist: https://gist.github.com/retostauffer/481297231c70efccf5158a7e654314f6
If this is fine to you we can close this issue.
Dear @gabroko
something for Friday!
The problem
Whilst processing our huge data set I ran into an issue today. When calculating the statistics it is possible that we have a sample size of
1
for a specific groups and I am not sure what would be best to do here.The reason (simplified)
An example: We have data for a specific study/home/room/variable starting "December 31, 23:55" and ends a few months later. If we assume we have 1 observation every 10 minutes, we will end up with one observation for "December" that year.
The effects
In this case our software properly calculates that we have
N = 1
and we can no longer guess/estimate the interval of the logger. Thus, we end up with nointerval_*
which currently causes an error inannex_validate()
.In addition, the standard deviation is obviously also missing and the percentiles (based on one value) is just a constant value.
Real-life
This happens for (one) of the loggers I am processing right now for one room where we don't have
N = 1
but only one non-missing value. The software properly reports that (as an example) haveN = 825
values, but thereof we haveNAs = 824
and, thus, only one non-missing value causing the exact same problem outlined above.Solutions to think about
annex_validate()
to consider it OK ifinterval_*
andSd
is missing ifN - NAs == 1
p*
) toNA
as well and tellannex_validate()
that this is what we expect ifN - NAs == 1
.