Incorporate daily aggregation functions from rSFSW2 into rSFSTEP2

kpalmqui commented 7 years ago

Currently, the daily SOILWAT2 output variables within rSFSTEP2 are saved for each DOY for every year the model runs (typically 300), which creates Tables that are extremely large. We would like to incorporate the functions/code that already exist in rSFSW2 that create aggregated daily output for SOILWAT2 variables. These aggregated daily variables represent mean and standard deviation values for each DOY across all years (essentially resulting in 365 values, instead of 109,500 for a 300 year run). This will greatly reduce the amount of daily data we are storing.

However, it would be nice to retain the functionality of being able to save all of the daily values for each year if the user wanted to. We could utilize flags set by the user to specify what time of daily output they would like.

Karan will take the lead on this, with help from Kyle and Daniel.

dschlaep commented 7 years ago

Kyle and I talked last week about the task to aggregate daily SOILWAT2 output variables in rSFSTEP2. I suggested that the best approach may be to re-use code that we have already written and implemented in the R package rSFSW2. Thus, I suggest that rSFSTEP2 loads the rSFSW2 namespace and calls its aggregation functions. With this approach, we don’t need to copy-paste code and code maintenance should be easier.

However, rSFSW2 uses the R package version of SoilWat (rSOILWAT2) whereas rSFSTEP2 uses STEPWAT2 with integrated SOILWAT2 during compilation. Thus, it may be that the needs between rSFSW2 and rSFSTEP2 differ substantially — I am not currently familiar enough with rSTSTEP2. Please explain to me the data objects that you have to calculate mean daily values.

rSFSW2 offers the (currently) not-exported functions ‘get_Response_aggL’ and ‘get_SWPmatric_aggL’ (R/rSOILWAT2_DataAccess.R) which do some of the daily aggregations. Also, there is code (currently not in form of a function, but see https://github.com/Burke-Lauenroth-Lab/rSFSW2/issues/62 for work in progress) in the function ‘do_OneSite’ (file R/Simulation_Run.R) beginning at if (prj_todos[[“adaily"]][["N"]] > 0 && tasks$aggregate[sc] > 0L) { which does most of the daily aggregations. Please tell me what type of daily aggregations you need to code.

I am pretty sure that the current version of rSWSW2 is not ready for you to easily use, but I hope that we can figure out how to make these daily aggregation functions usable for both rSFSW2 and rSFSTEP2.

dschlaep commented 7 years ago

This issue may need to be considered in conjunction with (STEPWAT2's issue #23)[https://github.com/Burke-Lauenroth-Lab/STEPWAT2/issues/23] because of data format and of potential redundancies.

Potentially, it may be best/fastest if the c code of STEPPE does these aggregations 'on the go', i.e., as the daily values are coming in from SOILWAT2 (using so called 'online' algorithms):

the online mean is trivial
the online sd can be calculated with the Welford algorithm

kpalmqui commented 7 years ago

Great points Daniel and this interdependency is exactly what Karan and I were discussing yesterday. Before we move forward, we agreed we need to take a step back and access where the daily aggregation should be done. The options:

Daily values for SOILWAT2 variables are outputted to disk by STEPWAT2 (issue #23) and then aggregated by rSFSTEP2 afterwards.
Daily output is aggregated within STEPWAT2 and then saved to disk when a flag is set that requests this type of output.

It is important to also keep in mind that these daily values will also be aggregated over iterations (each DOY value will represent the mean across all iterations that STEPWAT2 runs), which adds an additional layer of complexity. So essentially, if we want a mean value for each DOY then it will represent the mean across all years and all iterations that STEPWAT2 runs for.

An additional question: should we calculate separate SD values that capture across year variability and across iteration variability for each DOY separately? Or a single SD value for each DOY that represents variability across both years and iterations?

dschlaep commented 7 years ago

An additional question: should we calculate separate SD values that capture across year variability and across iteration variability for each DOY separately? Or a single SD value for each DOY that represents variability across both years and iterations?

It is not clear to me how to calculate SDs across iteration separately. Aren't you asking for a variance partitioning which estimates the percentage of variation attributed to years and to iteration based on total pooled variance?

kpalmqui commented 7 years ago

SD across years: SD for each DOY across all years for a single iteration SD across iterations: SD for each DOY across all iterations for a single year

This may not make sense.

dschlaep commented 7 years ago

SD across iterations:

Wouldn’t you normally get different values for each selection of year?

For instance, SD1 ≠ SD2 ≠ … for every x1 and x2 for letting SDx = SD for each DOY across all iterations for year x of each iteration.

Which year x do you select then?

On Jun 20, 2017, at 23:54, Kyle Palmquist notifications@github.com wrote:

SD across years: SD for each DOY across all years for a single iteration SD across iterations: SD for each DOY across all iterations for a single year

This may not make sense.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Burke-Lauenroth-Lab/rSFSTEP2/issues/1#issuecomment-309955863, or mute the thread https://github.com/notifications/unsubscribe-auth/AEAp232J98b7wk7-14dDjVsT5fqxznkAks5sGJQIgaJpZM4N7qvB.

kpalmqui commented 7 years ago

Yes you would get different values depending on the year. Similarly you would get different values for different iterations. Perhaps it is best to calculated a single SD that represents both variability across iterations and years.

dschlaep commented 7 years ago

Your idea to separate SD across years from SD across iterations has also merit, however, if there are non-stationary response dynamics. Such temporal trends would then inflate SD among years, but not SD among iterations.

Maybe we could calculate SD across years (and, if pooled, across iterations) only for the last S years of a simulation run? Where S = user input?

So what about something along the lines of:

Simulation output columns [(#rows) = (#iteration) x (#years) x (#doy)]:

iteration, year (or a stationary subset of the last S years), doy, value
Aggregation
- Grand mean across all iterations and years (or a stationary subset) [(#rows) = (#doy)]:
  
  doy, mean, sd
- Variance partitioning by ANOVA's SS: value ~ iteration * years
  - SS(iteration) / SS(total) vs. SS(years) / SS(total)
  - with SS(doy) = SS(residual) or instead of SS(total) use only SS(total) - SS(residual)?

kpalmqui commented 7 years ago

Can you clarify what you mean by:

Maybe we could calculate SD across years (and, if pooled, across iterations)

And by:

Simulation output columns [(#rows) = (#iteration) x (#years) x (#doy)]: iteration, year (or a stationary subset of the last S years), doy, value

Are you saying you want to save output for each iteration for each doy??

dschlaep commented 7 years ago

Maybe we could calculate SD across years (and, if pooled, across iterations) only for the last S years of a simulation run? Where S = user input?

Sorry, the phrase 'years (and, if pooled, across iterations)' was confusing. The main point I attempted to make here was 'only for the last S years of a simulation run'.

Simulation output columns [(#rows) = (#iteration) x (#years) x (#doy)]: iteration, year (or a stationary subset of the last S years), doy, value

I attempted to describe what values we would have to aggregate. I suggest to not save the 'simulation output', but instead only the 'aggregation' parts which would be a table with #doy = 366 rows and 5 columns (doy, overall mean, total sd, % due to iteration, % due to years).

I don't think that my specific suggestion for the variance partitioning is correct. Maybe Rui has more insights (@2hua)? My questions are whether we treat years as nested in replication, and whether we consider the autocorrelative nature of years? Plus it appears that we may want both years and iteration to consider as random factors?

2hua commented 7 years ago

I had an experience about calculation the SE about temperatures per hour in a year long. The solution is treating one day as a step length. Average the daily temperature and then aggregate the SD or SE for the yearly temperature variance. “SD across iterations: SD for each DOY across all iterations for a single year” seems difficult to calculate than the SD across years.

In my opinion, maybe we should treat iteration as random factor; treat years as fixed factor; and day nested in the year? like VALUE~iteraction*year(doy).

DrylandEcology / rSFSTEP2

Incorporate daily aggregation functions from rSFSW2 into rSFSTEP2 #1