deal with unreasonable uncertainty estimates

bgctw commented 5 years ago

REddyProc sometimes fails or provides strange results, if unreasonable NEE uncertainty estimates are provided. E.g. at some sites, NEE is repeatedly at the numerically exact same value at similar daytime, leasing to a zero uncertainty estimates. This in turn leads to non-finite values of cost-functions, e.g. in LRC fitting, and further upstream problems, whose cause is hard very hard to track down.

Therefore, REddyProc should early on detect such problematic sdNEE records, warn the user and do something about it before further processing.

I suggest the following

1) after gap-filling, REddyProc should check and warn the user 2) day-time partitioning should exit if the lowest sdNEE after controlling for leverage is too low 3) gap-filling based uncertainty estimation should provide an option to replace lower estimates by a user-specified minimum uncertainty.

lsigut commented 5 years ago

In general, the chance of having sequence of constant flux values is very unlikely and such data should be checked very carefully in the quality control step. The chance of having zero flux, even a single case, is practically zero, if no rounding is applied. In EddyPro the covariance is computed with the precision up to 16 or 17 decimals. If both scalar and vertical wind component are measured correctly, it should not be possible to produce integer 0 value. In my experience the lowest computed fluxes are on the order of |0.001|, so likely only rounding to 2 decimals or less could lead to valid zero flux. Thus such rounding should not be used during data processing. These zero fluxes sometimes appear also in our measurements, but it is quite rare and in the last case such values were produced after instrument failure (data gap). I never managed to have a deeper look on the cause. Such data should be excluded.

I would suggest:

To include the check for zero fluxes (NEE, H, LE) during the class initiation and warn the user. It is too late after gap-filling.
I agree with your point 2. In case the user decided to proceed even though.
I would omit point 3. Users should not be able to set uncertainty, this is what should be only computed. Users got warning that the input included spurious records. They should just get informative error during partitioning as proposed in step 2.

I have never seen a sequence of constant non-zero flux values so I would not consider that issue in the code.

bgctw commented 5 years ago

During class initiation, there is no meaning attached to column names. Hence, its difficult to implement during class initiation. We could provide the user with an option to specify the column names to check, but only users already aware of the problem would use it. So we either need to defer these checks or adopt conventions on variable naming.

During gap-filling or flux partitioning the user specifies the column to fill or the NEE column name. This provides a place to check.

Instead of checking for zero fluxes, I suggest to check for subsequent numerically equal fluxes. This addresses a more general set of problems, all causing improper uncertainty estimates.

Point 3 (user option of minimum uncertainty) would provide a workaround for users to work with datasets that otherwise cause failure. The alternative treatments to this option would be either setting all the data from such periods all to missing or adding noise.

bgctw commented 5 years ago

Please, have a look at the developments still in a branch: #24 I provided a function to filter long runs of equal values before constructing the class and implemented checks during gapfilling and daytime partitioning.

lsigut commented 5 years ago

I understood your solution only when I tried to develop the filter myself. I haven´t been sure if you want to only detect the problem and warn user or if you now plan to automatically treat the data by removing repeated runs. I wanted to have a look in the raw data to see the reason to have repeated runs but needed to mark those half-hours first. I tried to use diff(na.omit(x)) but realized this is good to register the issue (if you look for zero diffs) but not so elegant to identify the whole run (you miss the first one). On the other side rle() does not accept omit class so I might use a combination. I like your tests, it gives an impression what you consider (in)correct behaviour. I would say that NA removal before registering repeated runs is important. Consider this input, where I think you should register run of 5 zeros:

x <- rep(c(0, NA), 5)

Due to the malfunctioning instruments such intermittent runs could be possible.

bgctw commented 5 years ago

The current version only checks for long runs and warns the user. Moreover, it offers the routine to remove the runs.

In your example: c(rep(c(0, NA), 5) If I remove NAs before and search for runs of lenght >=6, I would miss this run. But probably its a run of length 9 or length 10.

I tried to develop a solution where NAs interspersing runs count. e.g. c(5,NA,5,5) -> length 4. But neighboring NAs do not extend the run e.g. c(5,5,NA,NA,NA,8) -> length 2 instead of length 5. Otherwise, I got many runs of NAs neighboring a single value: E.g. c(1,2,NA,NA,NA,-3) to me is not probably not indicating a run.

lsigut commented 5 years ago

If you check runs of lenght >=6 then yes. I meant if you check length >= 2. But c(rep(c(0, NA), 10) should produce warning even with lenght >=6.

lsigut commented 5 years ago

I modified your function and would use it for flagging of runs with length >= 2:

flag_runs <- function(x, name_out, length = 2) { # x numeric vector
  out <- rep(NA, length(x))
  not_NA <- !is.na(x)
  x <- x[not_NA]
  rl <- rle(x)$lengths
  keep <- rl >= length
  run_start <- (cumsum(rl) + 1 - rl)[keep]
  run_end <- cumsum(rl)[keep]
  df <- data.frame(run_start, run_end)
  runs <- unlist(apply(df, 1, function(x) x[1]:x[2]))
  out[not_NA] <- 0
  out[not_NA][runs] <- 2
  attributes(out) <- list(varnames = name_out, units = "-")
  return(out)
}
NEE <- c(rep(c(0, NA), 5)
flag_runs(x, "qc_NEE_runs")

lsigut commented 5 years ago

I did some testing on a specific dataset from our institute and have not seen any repeating runs actually. But I will keep the test and will add another one to check for too small fluxes close to zero. This is actually quite typical for winter months and standard QC does not catch these cases. See that you can observe those also in DETha98:

library(REddyProc)
data(Example_DETha98)
runs <- which(flag_runs(Example_DETha98$NEE, "qc_NEE_runs") == 2)
Example_DETha98$NEE[runs]
low_cov <- which(abs(Example_DETha98$NEE) < 0.01)
Example_DETha98$NEE[low_cov]
table(runs %in% low_cov)

I find it interesting that these cases are not forming runs. I assume that such records will not affect REddyProc algorithms so it is not necessary to handle them. But these are hardly sensible measurements, should be checked on the level of raw data and excluded. In the case of your datasets zeros are reported likely due to the rounding.

bgctw / REddyProc

deal with unreasonable uncertainty estimates #21