Tillotson Model -- Aligning methods

caitobrien commented 4 days ago

Hey @nickobeer & @JLGnotes,

Here's a running list of some of the discrepancies spotted and, following a meeting with Jenn, the course of action to align some of the methods. I am pulling from three sources: 1) Tillotson et al. 2022 methods, 2) NB shared Calibrate.R code, and 3) BOR shared code for track-a-cohort.

Ntrees: BOR selected 500, Tillotson mentions using default but sets at 300, and Calibrate.R has both 500 (train model) and 1000 (test model). Going forward, the default option with quantregForest() seems prudent. Looking through the source code, the default ntrees= 500. @nickobeer, please update code here in calibrate.R to match.
Train Data: as Jenn mentioned in the BOR meeting, we were hoping to clarify which training dataset to use for Track-a-cohort. It seems they are leaning towards including all years, so I am hoping to pull the data directly from DART to keep the process transparent. @nickobeer, I'm not entirely sure if you are already doing that for both the training and calibration part of your code. If you could confirm that would be great. Looking through all the training data sets across data sources, it appears that even with similar years, the values are different so it would be good to get the same data featured in all code/plots.
kNNImputation: Additionally, calibrate.R uses kNNimputation() to fill in missing covariate data , which isn't referenced in the paper. However, when trying to run in R, it appears that package is defunct. @nickobeer, could you determine if this is used in calibrate.R? And if so, is there rationale to support using it track-a-cohort code as well?
Precipitation: precipitation values varied for all codes, but going forward following tillotson et al 2022, a sum() should be used for weekly averages. Noting that a weekly sum is a 5-day not 7-day. See Table 1 for reference. When Susannah is back, I will check to see if this data is already bring brought in and if it could be made available in the DART River query. I assume this is the case because I believe that is how Nick is obtaining new data. @nickobeer, could you confirm you are getting a 5-day summed precipitation value. The code in Calibrate.R mentions a mean() but if you are already getting a sum, that might be a factor.
Model selection: Tillotson does both a quantregForest() and then a binary regressionForest() along with leave-one-out cross validation. The code shared via BOR and in Calibrate.R only include the quantregForest(). I plan to perform a simple comparison of the outputs and will share the results here for discussion. This will help us determine if using quantregForest() alone is sufficient, which is hinted at in the paper. Standby for those results...

Let me know if anything needs clarification.

nickobeer commented 4 days ago

here's the psql query to generate loss and salvage data frame

On Tue, Oct 8, 2024 at 10:35 AM caitobrien @.***> wrote:

Assigned #2 https://urldefense.com/v3/__https://github.com/Columbia-Basin-Research-CBR/track-a-cohort/issues/2__;!!K-Hz7m0Vt54!n1AfGvzJQj0YmeF3Iv1iqHweJvUonMMum1TuI7-IgBIXZ_OTzE8FNECvrkVgEoWLVfgfqj-hpCdIZycI-G0CREY$ to @nickobeer https://urldefense.com/v3/__https://github.com/nickobeer__;!!K-Hz7m0Vt54!n1AfGvzJQj0YmeF3Iv1iqHweJvUonMMum1TuI7-IgBIXZ_OTzE8FNECvrkVgEoWLVfgfqj-hpCdIZycIcW6ScW8$ .

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/Columbia-Basin-Research-CBR/track-a-cohort/issues/2*event-14560612334__;Iw!!K-Hz7m0Vt54!n1AfGvzJQj0YmeF3Iv1iqHweJvUonMMum1TuI7-IgBIXZ_OTzE8FNECvrkVgEoWLVfgfqj-hpCdIZycIVcHhArA$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/A26S3RFZZW2IF7JFO7T3BGDZ2QJXHAVCNFSM6AAAAABPSYWP3CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJUGU3DANRRGIZTGNA__;!!K-Hz7m0Vt54!n1AfGvzJQj0YmeF3Iv1iqHweJvUonMMum1TuI7-IgBIXZ_OTzE8FNECvrkVgEoWLVfgfqj-hpCdIZycIgL019ag$ . You are receiving this because you were assigned.Message ID: <Columbia-Basin-Research-CBR/track-a-cohort/issue/2/issue_event/14560612334 @github.com>

caitobrien commented 4 days ago

recap from chat with NB:

2. Train Data: as Jenn mentioned in the BOR meeting, we were hoping to clarify which training dataset to use for Track-a-cohort. It seems they are leaning towards including all years, so I am hoping to pull the data directly from DART to keep the process transparent. @nickobeer, I'm not entirely sure if you are already doing that for both the training and calibration part of your code. If you could confirm that would be great. Looking through all the training data sets across data sources, it appears that even with similar years, the values are different so it would be good to get the same data featured in all code/plots.

Update to calibrate.R training dataset: Training data used for calibrate.R was sourced by Chris V. since unable to recreate Tillotson exact training dataset. Therefore, the training data set should match any data sourced from DART unless data has been changed at the data source level. Currently, the train data from 1999:2000 in Calibrate.R is a static file and each new year is appended. File named :df.main <- read.csv(here("data-raw/AllYears.Intake.csv"))

4. Precipitation: precipitation values varied for all codes, but going forward following tillotson et al 2022, a sum() should be used for weekly averages. Noting that a weekly sum is a 5-day not 7-day. See Table 1 for reference. When Susannah is back, I will check to see if this data is already bring brought in and if it could be made available in the DART River query. I assume this is the case because I believe that is how Nick is obtaining new data. @nickobeer, could you confirm you are getting a 5-day summed precipitation value. The code in Calibrate.R mentions a mean() but if you are already getting a sum, that might be a factor.

@nickobeer and @caitobrien looked at precip data from training datasets shared by BOR and used in Calibrate.R, seems to match. Looking at the postgres script used to source calibrate.R new data, precip data is 5-day sum and then converts to csf. Within calibrate.R, a mean of the sum returns the summed value for the week. @caitobrien to recreate code for track-a-cohort and confirm matches AllYears.Intake.csv in Calibrate.R.

nickobeer commented 1 day ago

Typos on the Loss and Salvage page are corrected to properly identify the years of data that went into the calibrations for each of the species. These are: 1999-2020 or 2009-2020 for both Chinook and steelhead. The introductory text to the right when the page opens eludes to the 2009-2020 data set as the default. This was missing some context since the text with the radio buttons read either '1999-2024' or '2009-2024'.

Columbia-Basin-Research-CBR / track-a-cohort

Tillotson Model -- Aligning methods #2