Daily uncertainty aggregation does not consider filtering NEE_uStar_fqc == 0

lsigut commented 5 years ago

Hi, I was comparing with my results using base R functions and could not arrive to those presented for daily aggregation in vignette. After some testing I noticed that %>% filter(NEE_uStar_fqc == 0) is missing in the pipe, i.e. all daily values of NEE_uStar_fsd are used for aggregation without filtering. I confirmed when I removed the filtering in my code. Unfortunately, the solution is not so easy as including above filter in the pipe will also filter nEff that will again affect the uncertainty results.

bgctw commented 5 years ago

When dealing with empirical autocorrelation, it is important to keep the equidistant time steps. Filtering the data before will most probably introduce some faults. Do you work with equidistant steps in your results using base functions?

lsigut commented 5 years ago

Yes, the functions are now available in openeddy. You can check openeddy::agg_fsd and openeddy::agg_DT_SD functions and documentation in the package github dev version. They are finished now. The regular timestamp is required, only the measured data are used for nEff estimation. The only problem is that in some cases there is not enough values per aggregation period. In that case NAs are produced. The main advantage of those functions is that they evaluate uncertainty for all variables available using autodetection. That way you provide REddyProc full output and extract uncertainties for all meteo and uStar scenarios at once for given aggregation interval (see the examples in documentation).

bgctw commented 5 years ago

Do you have any specific suggestions, what should be still done?

lsigut commented 5 years ago

Does not seem to me that the vignette code changed. If I remember, maybe it requires two separate pipes. One for nEff, second for fsd aggregation. I am not familiar with pipes though.

bgctw commented 3 years ago

@lsigut: I updated the aggregation vignette

explaining filtering for quality flag
versus NA values for correlation estimates and
added a filter pipe in the daytime-aggregation.

Can you, please, have a look at commit 9b929cef7dc1656e99525e621aab73e5fb8cc38d and see if this resolved the disagreement with your calculations.

lsigut commented 3 years ago

The fix is not successful because the quality filter does not work as intended. See that the outcome is actually identical to not including it at all:

results <- EProc$sExportResults() %>% 
  mutate(resid = ifelse(NEE_uStar_fqc == 0, NEE_uStar_orig - NEE_uStar_fall, NA)) %>% 
  mutate(resid2 = NEE_uStar_orig - NEE_uStar_fall) 
identical(results$resid, results$resid2[1:17520])  
# resid2 has only additional attributes (stripped here by subsetting)

When comparing to your partial results on the daily scale, I have identical nEff. Therefore the problem must be that QC should be done only on _fsd columns (e.g. NEE_uStar_fsd). The problem can be better understood if you check header(results). Since NEE_uStar_orig is already effectively filtered for quality, it will also affect the product of NEE_uStar_orig - NEE_uStar_fall. Additionally, if you check aggDay, our results are identical at 6th and 7th row, where nRec == 48, i.e. no QC filtering needed.

Here is a more complete code where you can check the differences:

library(REddyProc)
library(dplyr)
library(lognorm)
library(openeddy)

EddyDataWithPosix <- Example_DETha98 %>% 
  filterLongRuns("NEE") %>% 
  fConvertTimeToPosix('YDH',Year = 'Year',Day = 'DoY', Hour = 'Hour')
EProc <- sEddyProc$new(
  'DE-Tha', EddyDataWithPosix, c('NEE','Rg','Tair','VPD', 'Ustar'))
EProc$sMDSGapFillAfterUstar('NEE', uStarTh = 0.3, FillAll = TRUE)
results <- EProc$sExportResults() 
summary(results$NEE_uStar_fsd)

results %>% filter(NEE_uStar_fqc == 0) %>% summarise(
  nRec = sum(is.finite(NEE_uStar_f))
  , varSum = sum(NEE_uStar_fsd^2, na.rm = TRUE)
  , seMean = sqrt(varSum) / nRec
  , seMeanApprox = mean(NEE_uStar_fsd, na.rma = TRUE) / sqrt(nRec)
) %>% select(nRec, seMean, seMeanApprox)

# this step can be removed (or better amended to filter NEE_uStar_fsd instead)
results <- EProc$sExportResults() %>% 
  mutate(
    resid = ifelse(NEE_uStar_fqc == 0, NEE_uStar_orig - NEE_uStar_fall, NA )
  )

autoCorr <- computeEffectiveAutoCorr(results$resid)
nEff <- computeEffectiveNumObs(results$resid, na.rm = TRUE)
c( nEff = nEff, nObs = sum(is.finite(results$resid)))

results %>% filter(NEE_uStar_fqc == 0) %>% summarise(
  nRec = sum(is.finite(NEE_uStar_fsd))
  , varMean = sum(NEE_uStar_fsd^2, na.rm = TRUE) / nRec / (!!nEff - 1)
  , seMean = sqrt(varMean) 
  #, seMean2 = sqrt(mean(NEE_uStar_fsd^2, na.rm = TRUE)) / sqrt(!!nEff - 1)
  , seMeanApprox = mean(NEE_uStar_fsd, na.rm = TRUE) / sqrt(!!nEff - 1)
) %>% select(seMean, seMeanApprox)

results <- results %>% mutate(
  DateTime = EddyDataWithPosix$DateTime
  , DoY = as.POSIXlt(DateTime - 15*60)$yday # midnight belongs to the previous
)

# openeddy results for comparison (yearly scale):
results$timestamp <- results$DateTime - 60*15
year <- agg_fsd(results, "%Y", agg_per = "year-1")
year

aggDay <- results %>% group_by(DoY) %>% 
  summarise(
    DateTime = first(DateTime)
    , nRec = sum( NEE_uStar_fqc == 0, na.rm = TRUE)
    , nEff = computeEffectiveNumObs(
      resid, effAcf = !!autoCorr, na.rm = TRUE)
    , NEE = mean(NEE_uStar_f, na.rm = TRUE)
    , sdNEE = if (nEff <= 1) NA_real_ else sqrt(
      mean(NEE_uStar_fsd^2, na.rm = TRUE) / (nEff - 1)) 
    , sdNEEuncorr = if (nRec == 0) NA_real_ else sqrt(
      mean(NEE_uStar_fsd^2, na.rm = TRUE) / (nRec - 1))
  )
aggDay

# openeddy results for comparison (daily scale):
results$timestamp <- results$DateTime - 60*15
day <- agg_fsd(results, "%Y-%m-%d", agg_per = "day-1")
plot(day$mean$NEE_uStar_fsd_mean, aggDay$sdNEE)
day$mean[1:10, "NEE_uStar_fsd_mean"]
aggDay$sdNEE[1:10]
lines(0:6, 0:6)

bgctw commented 3 years ago

For computing mean and sd we need to filter for the quality flag. However, for computing the effective number of observations we are not allowed to remove records from the time series, i.e. filter. Hence, I introduced a bug with the previous commit. Please, have a look at commit fd448e2d60915082857936b5c2e3a8749e9a58a, if this is now correct and better explained.

bgctw commented 3 years ago

That is still wrong. Getting into the topic again I realize that we aim for:

use all the data, including the filled to estimate the mean
use only the uncertainty of the original data (qualitiy-flag == 0) to estimate the number of observations

lsigut commented 3 years ago

I agree that it is a bit tricky and I spent some time to implement it in openeddy based on your advices. I had also hard times with the statistics though, not only with the data wrangling, so thank you for explanations! I think what really helps is that *_orig is always already QC filtered. So which records are measured and which are "predicted" is already specified and the fact can be used during computations. What I did was to make a single computation algorithm that works for all arbitrary time intervals (including yearly and daily). Then I checked against your results. I think that what I noticed was that my results would agree on yearly estimates, but not daily. It is a long time ago, but I assume that if I changed the algorithm to produce the same results on the daily resolution, I would likely not arrive to same yearly results as yours. But I obviously still cannot say I did not overlook anything 100%. I was just hoping that if both of us use different algorithms and arrive at the same results just agreeing on the principles/requirements, it would validate the approach. I think that the main complication was the requirement to compute autocorrelation on the complete dataset. Another question could be, what if 10 years of data are processed. But I imagine that computing autocorrelation from the whole period will only make the estimate more robust as it should be rather a general property of the type of measurement. Notice I do not want to force any change and I would prefer if you arrived to the same conclusion independently because that would be the real validation. If I understand well, your last comment here is in line with what I was suggesting in my last comment.

bgctw commented 3 years ago

I created a version (28505eb08746616c31293c7669b9f1a5c09036d9) where I removed all the filtering but instead created columns with explicit NA in for uncertainty of gap-filled records. This is simpler and hopefully explaining the issues more clearly. I am looking forward to critical feedback.

bgctw commented 3 years ago

Thanks for taking time to sit together and confirming that the issue is resolved with the last commit.

bgctw / REddyProc

Daily uncertainty aggregation does not consider filtering NEE_uStar_fqc == 0 #33