All existing site-years are now read into ThirdStage by default. However ThirdStage will only write outputs to the database for years within the range startyear:endyear (if only startyear is provided, endyear=startyear) as per @znesic request
I fixed the database reading routine to be a bit more robust, by skipping empty traces instead of crashing. Instead of altering the existing read_database.R function, I added a simplified version of the routine directly to ThirdStage.R so that it doesn't alter the expected behaviour of anyone using ReadDatabase.R for other purposes.
REddyProc will be given input data spanning Dec 1st (startyear - 1) to Feb 28/9 (endyear+1) wherever possible to maintain consistency with REddyProc season definitions. This ensures the uStar filtering routine has the full seasons where they exist. However, the time period outside of startyear:endyear will not be altered. Note this could result in inconsistent application of uStar thresholds for December of previous years if past data are not re-run.
The random forest routine has been setup to load pre-existing models when possible minimize run time. A setting was added to the global configuration file to automatically re-train the model every n months as new data become available. The procedure loads the model, check when it was created, and re-trains if the model is older than n months. For now, the default interval was set to 1 month, but can be changed as desired. When re-training the RF model is trained one All Site Years Available. This increases run-time when training but ensures the model is provided the most information available. As per 1. only startyear:endyear results will be updated in the database. This could lead to disparate results between years, so in the future, the re-training routine should probably be setup to write RF outputs for all years whenever re-training occurs. This would be a fairly straightforward change to implement and could be done alongside a routine to save a copy (or compressed .zipfile) of the previous ThirdStage RF outputs for backwards compatibility purposes. But we'll need to decide where to put any backups.
@sknox01 The RF model included two custom variables sin(DOY) and cos(DOY). I took those out of the RF routine. My logic is that these inputs are hidden from the user when defining the input parameters in the ThirdStage configuration. They can be put back in if desired, but I think a more transparent approach would be to have any seasonality based traces included in the database (could be created in stage 2 or created somewhere up the line in stage 3) and allow the user to define whether or not to use these parameters when they define the model inputs in the configuration file. Also, it probably wouldn't make much of a difference, but it may be better to use a more physically based variable like solar declination to describe seasonality, which is tied to the solstices, instead of DOY which is slightly offset from the solstice and shifted by leap years.
I made some slight changes to the db_root.R script to open up more options for defining the root of the database. This should maintain compatibility with whatever folks already have working, but also open up options for more explicit definitions later.
@znesic this shouldn't require any big changes for calling from Matlab, I didn't end up needed needed to change the input arguments. If you wanted to run ThirdStage for only 2024, the command line arguments for your Matlab Script would be:
Alternatively, you could have the matlab script loop the call year by year, one year at a time. This would produce the same desired, output but would result in a modest increase in run time because you'd end up re-processing the winter seasons multiple times (see 3.)
Changes made with this pull request:
All existing site-years are now read into ThirdStage by default. However ThirdStage will only write outputs to the database for years within the range startyear:endyear (if only startyear is provided, endyear=startyear) as per @znesic request
I fixed the database reading routine to be a bit more robust, by skipping empty traces instead of crashing. Instead of altering the existing read_database.R function, I added a simplified version of the routine directly to ThirdStage.R so that it doesn't alter the expected behaviour of anyone using ReadDatabase.R for other purposes.
REddyProc will be given input data spanning Dec 1st (startyear - 1) to Feb 28/9 (endyear+1) wherever possible to maintain consistency with REddyProc season definitions. This ensures the uStar filtering routine has the full seasons where they exist. However, the time period outside of startyear:endyear will not be altered. Note this could result in inconsistent application of uStar thresholds for December of previous years if past data are not re-run.
The random forest routine has been setup to load pre-existing models when possible minimize run time. A setting was added to the global configuration file to automatically re-train the model every n months as new data become available. The procedure loads the model, check when it was created, and re-trains if the model is older than n months. For now, the default interval was set to 1 month, but can be changed as desired. When re-training the RF model is trained one All Site Years Available. This increases run-time when training but ensures the model is provided the most information available. As per 1. only startyear:endyear results will be updated in the database. This could lead to disparate results between years, so in the future, the re-training routine should probably be setup to write RF outputs for all years whenever re-training occurs. This would be a fairly straightforward change to implement and could be done alongside a routine to save a copy (or compressed .zipfile) of the previous ThirdStage RF outputs for backwards compatibility purposes. But we'll need to decide where to put any backups.
I made some slight changes to the db_root.R script to open up more options for defining the root of the database. This should maintain compatibility with whatever folks already have working, but also open up options for more explicit definitions later.
@znesic this shouldn't require any big changes for calling from Matlab, I didn't end up needed needed to change the input arguments. If you wanted to run ThirdStage for only 2024, the command line arguments for your Matlab Script would be:
Rscript --vanilla C:/Biomet.net/R/database_functions/ThirdStage.R siteID 2024
If you wanted to run a range of years e.g., 2020:2024 it would be:
Rscript --vanilla C:/Biomet.net/R/database_functions/ThirdStage.R siteID 2020 2024
Alternatively, you could have the matlab script loop the call year by year, one year at a time. This would produce the same desired, output but would result in a modest increase in run time because you'd end up re-processing the winter seasons multiple times (see 3.)