Suffix assignment to sGLFluxPartition output is not complete and leads to overwriting

lsigut commented 6 years ago

Hi, I am trying to do a workaround for the behaviour that is not allowing me to store all the parameters of sGLFluxPartition when run over multiple suffixes. If that is not too problematic due to the internal structure, it would be better if it can be corrected in the first place. If I use suffixes c('uStar', 'U05', 'U50', 'U95', 'UNone'), I will get some of the columns with suffixes after sExportResults, while some not. Some of them are identical for different runs, some of them not (PotRad, FP_Temp should not get suffixes as they are identical, also in sMRFluxPartition). Those variables without suffix will be overwritten, I will get only the results for 'UNone' and Warning message is shown:

Warning message:
In EC$sGLFluxPartition(TempVar.s = paste(FP_temp, "_f", sep = ""),  :
  replacing existing output columnsFP_VARnight, FP_VARday, NEW_FP_Temp, NEW_FP_VPD, FP_RRef_Night, FP_qc, FP_dRecPar, FP_errorcode, FP_GPP2000, FP_OPT_VPD, FP_OPT_NoVPD, FP_k, FP_beta, FP_alpha, FP_RRef, FP_E0, FP_k_sd, FP_beta_sd, FP_alpha_sd, FP_RRef_sd, FP_E0_sd, Reco_DT_uStar, GPP_DT_uStar, Reco_DT_uStar_SD, GPP_DT_uStar_SD

When I go through the suffix loop, resave the object and run sGLFluxPartition for 'uStar' suffix again I can check that content of these columns is not identical and thus should have respective suffix:

 [1] "Reco_DT_uStar_SD" "GPP_DT_uStar_SD"  "FP_VARnight"      "FP_VARday"        "FP_RRef_Night"   
 [6] "FP_qc"            "FP_dRecPar"       "FP_errorcode"     "FP_GPP2000"       "FP_k"            
[11] "FP_beta"          "FP_alpha"         "FP_RRef"          "FP_E0"            "FP_k_sd"         
[16] "FP_beta_sd"       "FP_alpha_sd"      "FP_RRef_sd"       "FP_E0_sd"

Please ignore "Reco_DT_uStar_SD", "GPP_DT_uStar_SD". These are probably result of some random number generation during SD reestimation for the second time. Also the warning message is complaining about rewriting of variables with 'uStar' suffix which is correct behaviour. You can notice that warning message is also mentioning collumns that are not finally exported. I would suggest to include suffix support for the above vector of variable names and relax the expectations for the unexported ones. I assume it will not affect the correct working of the method. Maybe you could also include suffixes in the internal collumns and get rid of warning() whatsoever - replacing original content with almost identical one should not cause issues. Thanks!

bgctw commented 6 years ago

Originally, REddyProc has been designed for a defined workflow. It is not recommended to run gapfilling or partitioning more than once with the same class, instead, instantiate a new object by sEddyProc$new from the same data. Currently, the code adds new columns to the results data.frame. This may cause conflicts when columns with the same name are existing. At some instances, the code checks for existing columns, deletes those, and issues the mentioned warning to notify the user that this is probably not what the user wanted. A redesign would require quite some work and needs to be thoroughly tested.

Hence, for repeated processing, please, instantiate a new object before, or use the functions that are not part of sEddyProc class. In the second case you need to take care yourself for storing the outputs. In your case you would use function partitionNEEGL.

Regarding the additional output columns: The many GPP_suffix and Reco_suffix columns, and the many columns of parameters and intermediate results are already confusing to some users. Hence, we decided to restrict the suffix to the final results. If you need quantiles of the intermediate output, you need to take care for the output yourself for now between runs with different suffixes.

If you make a strong case for a common use case, that requires the distribution of some specific intermediate results, we could negotiate to add suffixes for some of additional output columns.

lsigut commented 6 years ago

Thank you for explanations! For me the most important are the parameters of the light response curve as they can be used for further analysis. Though I do not need the quantiles of the parameters, I want to make sure that the one I want to use are not overwritten. I see that it could be difficult to implement. I add my workaround:

# Save colnames to a list after sMRFluxPartition method
colnames <- list()
colnames$FP_MR05 <- colnames(EddyProc.C$sExportResults())

# Lasslop et al. (2010)
suffixes <- c('uStar', 'U05', 'U50', 'U95', 'UNone')
FP_temp <- 'Tsoil'
saveRDS(EddyProc.C, file = "EddyProc.C.rds")
FP_GL10_out <- EddyProc.C$sExportResults()[, 0]
for (i in suffixes) {
  rm(EddyProc.C)
  EddyProc.C <- readRDS("EddyProc.C.rds")
  EddyProc.C$sGLFluxPartition(
    TempVar.s = paste(FP_temp, "_f", sep = ""), 
    QFTempVar.s = paste(FP_temp, "_fqc", sep = ""),
    Suffix.s = i)
  if (i == 'uStar') saveRDS(EddyProc.C, file = "EddyProc.C_GL10_uStar.rds")
  out <- EddyProc.C$sExportResults()
  cols <- colnames(out)
  cols_out <- cols[!cols %in% colnames$FP_MR05]
  out <- out[cols_out]
  corr_filter <- !cols_out %in% grep(paste(suffixes, collapse = "|"), 
                                     cols_out, value = TRUE)
  names(out)[corr_filter] <- paste(cols_out[corr_filter], i, sep = '_')
  FP_GL10_out <- cbind(FP_GL10_out, out)
}
rm(EddyProc.C)
EddyProc.C <- readRDS("EddyProc.C_GL10_uStar.rds")
colnames$FP_GL10 <- colnames(FP_GL10_out)

The output that I want is then kept in FP_GL10_out. I also did not notice before that the active binding in EddyProc.C leads to "cloning" when the object is resaved under different name. The results of computations done on the new object will be also transferred to the original one. Thus I better save the object as a file and reload it after each iteration. I hope it could also help to others.

bgctw commented 5 years ago

Hello @lsigut. Have a look at pull request #35. I added argument uStarScenKeep to sEddyProc_sApplyUStarScen and the bootstrapped partitioning methods. By default results for the first uStar threshold scenario (usually the non-bootstrapped) are kept. Please, comment on if this resolves your needs and on the interface (argument naming ...).

lsigut commented 5 years ago

Hello @progtw, thanks for having a look into it! I was just expecting that the solution would be adequate to nighttime-based partitioning where all scenario-specific variables are renamed using the suffix system. I was expecting that this would be the only needed step - to evaluate which columns are scenario specific, attach suffix and let those that are the same for all scenarios to rewrite without warning. All such outputs would be then possible to get when extracting results. But of course I do not know the internal design of sGLFluxPartition. I am not sure if it is not too complicated to use it now. The test_that example is not well readable for me. Some example in vignette could help!

bgctw commented 5 years ago

The complications for this proposal for the design are one issue, but the main I repeat:

The many GPP_suffix and Reco_suffix columns, and the many columns of parameters and intermediate results are already confusing to some users. Hence, we decided to restrict the suffix to the final results.

In hindsight, it might have been better also for the night-time partitioning to report only the gross fluxes with suffixes and the others outputs only for one scenario, but I do not want to change the interface now.

In your workaround you demonstrate how to take care of this using the Suffix argument for other ouptuts in your user-loop:

for (suffix in scenarios) {
  EProc$sGLFluxPartition(..., Suffix = suffix)
  myStorageOfCols[,suffix] <- EProc$sExportResults()[[interestingColName]]
}

But for your purpose, as I understood, you need only the output of a single specific uStar threshold scenario and this is now supported with the suggested PR directly.

In the PR (pull request) #35 I extended the example in the vignette uStarCases.Rmd by daytime partitioning and added some explanation.

lsigut commented 5 years ago

Great, thanks for the explanations, also for the vignette!

EarthyScience / REddyProc

Suffix assignment to sGLFluxPartition output is not complete and leads to overwriting #17