edits for 02_fit_gradient_forests.ipynb

DrK-Lo commented 2 years ago

Some of the intro language is confusing, is the purpose of the script to train the GF, make a prediction from GF, or test the GF? e.g.

"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data, garden_data." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?

The file directory and setup needs explantion: garden_dir = makedir(op.join(DIR, 'fitting/gradient_forests/garden_files')) fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles')) training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')
Maybe it was because I was not reviewing the final script, but the output from testing the fitting script wasn't what I was expecting tting/gradient_forests/fitting_outfiles/1231094_pooled_all_100_gradient_forest_offset.txt")

offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns

Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.

Let's make sure to plot some visualizations of the CI curves for each environment and overall CI

brandonlind commented 2 years ago

Some of the intro language is confusing, is the purpose of the script to train the GF, make a prediction from GF, or test the GF? e.g.

"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data, garden_data." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?

the way I've been using train/fit is like this - i use the genetic/climate data to train a model (eg GF) then I use future climate to fit that model (this gives a prediction). After fitting comes validation where I test the fitted model (ie where I test the prediction) using known fitness.

The file directory and setup needs explantion: garden_dir = makedir(op.join(DIR, 'fitting/gradient_forests/garden_files')) fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles')) training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')

Note I use a different directory structure for the command line script (see README in 01_src) that I think makes more sense than this layout. In this layout there is a fitting directory and a training directory where each has subfolders regarding GF, RONA etc. In the new layout for the command line scripts, the method is the top layer and fitting/training/validation are subdirectories of the method.

notebook directory layout (note a lot of the times I'll print out full paths when the code actually saves things so i don't have to do mental gymnastics with the op.join / op.dirname stuff):

DIR
|
|------ fitting
|           |
|           |------ gradient_forests
|           |           |------garden_files   # garden_dir = make_dir(op.join(DIR, 'fitting/gradient_forests/garden_files'))
|           |           |
|           |           |------fitting_outfiles   #  fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles'))
|
|------ training
|           |
|           |------ gradient_forests
|           |           |------garden_files   #  training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')

Maybe it was because I was not reviewing the final script, but the output from testing the fitting script wasn't what I was expecting tting/gradient_forests/fitting_outfiles/1231094_pooled_all_100_gradient_forest_offset.txt")

offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns

Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.

This is because I parallelized the fitting to each common garden, so I haven't combined fitted files (and I can therefore read them in in parallel as well). So for this file that you referenced, are offsets to garden 100 (100 is in the filename). Note also that I almost always use row names instead of default indices (it looks like the output is the first 5 and last 5 rows of the file), I'm always scared indices will get shuffled so I almost always will use names in place for look-ups etc - here the row names are subpopIDs.

Let's make sure to plot some visualizations of the CI curves for each environment and overall CI

TODO

DrK-Lo commented 2 years ago

Sounds good!

The only thing is that I'm not sure everyone will understand the use of "fit" with calculating a prediction. People are more familiar with "fit" in this context:

"Model fitting is a measure of how well a machine learning modelhttps://www.datarobot.com/wiki/model/ generalizes to similar data to that on which it was trainedhttps://www.datarobot.com/wiki/training-validation-holdout/. A model that is well-fitted produces more accurate outcomes. A model that is overfittedhttps://www.datarobot.com/wiki/overfitting/ matches the data too closely. A model that is underfittedhttps://www.datarobot.com/wiki/underfitting/ doesn’t match closely enough."

What do you think of this for a common language:

use the genetic & climate data to "train" a model
use the future/transplant climate and the model to "calculate an offset prediction"
use the ground-truth fitness and offset prediction to "evaluate" the model (not "validation", when you read the AREES review you will understand the difference)

From: Brandon Lind @.> Sent: 18 April 2022 18:02 To: ModelValidationProgram/MVP-offsets @.> Cc: Lotterhos, Katie @.>; Author @.> Subject: Re: [ModelValidationProgram/MVP-offsets] edits for 02_fit_gradient_forests.ipynb (Issue #14)

Some of the intro language is confusing, is the purpose of the script to train the GF, make a prediction from GF, or test the GF? e.g.

"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data, garden_data." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?

the way I've been using train/fit is like this - i use the genetic/climate data to train a model (eg GF) then I use future climate to fit that model (this gives a prediction). After fitting comes validation where I test the fitted model (ie where I test the prediction) using known fitness.

The file directory and setup needs explantion: garden_dir = makedir(op.join(DIR, 'fitting/gradient_forests/garden_files')) fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles')) training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')

Note I use a different directory structure for the command line script (see README in 01_srchttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FModelValidationProgram%2FMVP-offsets%2Ftree%2Fpractice%2F01_src&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945655870786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=StcPGrrtyLDM7B3SSbr3yeE9ZtjYERHLSGvSsiJLnxE%3D&reserved=0) that I think makes more sense than this layout. In this layout there is a fitting directory and a training directory where each has subfolders regarding GF, RONA etc. In the new layout for the command line scripts, the method is the top layer and fitting/training/validation are subdirectories of the method.

notebook directory layout (note a lot of the times I'll print out full paths when the code actually saves things so i don't have to do mental gymnastics with the op.join / op.dirname stuff):

DIR

|

|------ fitting

| |

| |------ gradient_forests

| | |------garden_files # garden_dir = make_dir(op.join(DIR, 'fitting/gradient_forests/garden_files'))

| | |

| | |------fitting_outfiles # fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles'))

|

|------ training

| |

| |------ gradient_forests

| | |------garden_files # training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')

Maybe it was because I was not reviewing the final script, but the output from testing the fitting script wasn't what I was expecting tting/gradient_forests/fitting_outfiles/1231094_pooled_all_100_gradient_forest_offset.txt")

offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns

Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.

This is because I parallelized the fitting to each common garden, so I haven't combined fitted files (and I can therefore read them in in parallel as well). So for this file that you referenced, are offsets to garden 100 (100 is in the filename). Note also that I almost always use row names instead of default indices (it looks like the output is the first 5 and last 5 rows of the file), I'm always scared indices will get shuffled so I almost always will use names in place for look-ups etc - here the row names are subpopIDs.

Let's make sure to plot some visualizations of the CI curves for each environment and overall CI

TODO

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FModelValidationProgram%2FMVP-offsets%2Fissues%2F14%23issuecomment-1101523933&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945655870786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZCXXzmUUAca%2BIJ5Es2Umei7ZJHe5W%2FyN1scOrgS3Pt8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABUNI3OUMDRBVZ4KKGFUC4DVFWBSHANCNFSM5TVXJFYQ&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945656027029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=r%2B99q%2BXFBsNNnXMUKDlAvSB%2FZcp1KciqPW7FFYtFeuc%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

DrK-Lo / MVP-offsets

edits for 02_fit_gradient_forests.ipynb #14