Closed DrK-Lo closed 2 years ago
- Some of the intro language is confusing, is the purpose of the script to train the GF, make a prediction from GF, or test the GF? e.g.
"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data,
garden_data
." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?
the way I've been using train/fit is like this - i use the genetic/climate data to train a model (eg GF) then I use future climate to fit that model (this gives a prediction). After fitting comes validation where I test the fitted model (ie where I test the prediction) using known fitness.
- The file directory and setup needs explantion: garden_dir = makedir(op.join(DIR, 'fitting/gradient_forests/garden_files')) fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles')) training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')
Note I use a different directory structure for the command line script (see README in 01_src) that I think makes more sense than this layout. In this layout there is a fitting directory and a training directory where each has subfolders regarding GF, RONA etc. In the new layout for the command line scripts, the method is the top layer and fitting/training/validation are subdirectories of the method.
notebook directory layout (note a lot of the times I'll print out full paths when the code actually saves things so i don't have to do mental gymnastics with the op.join / op.dirname stuff):
DIR
|
|------ fitting
| |
| |------ gradient_forests
| | |------garden_files # garden_dir = make_dir(op.join(DIR, 'fitting/gradient_forests/garden_files'))
| | |
| | |------fitting_outfiles # fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles'))
|
|------ training
| |
| |------ gradient_forests
| | |------garden_files # training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')
- Maybe it was because I was not reviewing the final script, but the output from testing the fitting script wasn't what I was expecting tting/gradient_forests/fitting_outfiles/1231094_pooled_all_100_gradient_forest_offset.txt")
offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns
Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.
This is because I parallelized the fitting to each common garden, so I haven't combined fitted files (and I can therefore read them in in parallel as well). So for this file that you referenced, are offsets to garden 100 (100 is in the filename). Note also that I almost always use row names instead of default indices (it looks like the output is the first 5 and last 5 rows of the file), I'm always scared indices will get shuffled so I almost always will use names in place for look-ups etc - here the row names are subpopIDs.
- Let's make sure to plot some visualizations of the CI curves for each environment and overall CI
TODO
Sounds good!
The only thing is that I'm not sure everyone will understand the use of "fit" with calculating a prediction. People are more familiar with "fit" in this context:
"Model fitting is a measure of how well a machine learning modelhttps://www.datarobot.com/wiki/model/ generalizes to similar data to that on which it was trainedhttps://www.datarobot.com/wiki/training-validation-holdout/. A model that is well-fitted produces more accurate outcomes. A model that is overfittedhttps://www.datarobot.com/wiki/overfitting/ matches the data too closely. A model that is underfittedhttps://www.datarobot.com/wiki/underfitting/ doesn’t match closely enough."
What do you think of this for a common language:
From: Brandon Lind @.> Sent: 18 April 2022 18:02 To: ModelValidationProgram/MVP-offsets @.> Cc: Lotterhos, Katie @.>; Author @.> Subject: Re: [ModelValidationProgram/MVP-offsets] edits for 02_fit_gradient_forests.ipynb (Issue #14)
"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data, garden_data." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?
the way I've been using train/fit is like this - i use the genetic/climate data to train a model (eg GF) then I use future climate to fit that model (this gives a prediction). After fitting comes validation where I test the fitted model (ie where I test the prediction) using known fitness.
Note I use a different directory structure for the command line script (see README in 01_srchttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FModelValidationProgram%2FMVP-offsets%2Ftree%2Fpractice%2F01_src&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945655870786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=StcPGrrtyLDM7B3SSbr3yeE9ZtjYERHLSGvSsiJLnxE%3D&reserved=0) that I think makes more sense than this layout. In this layout there is a fitting directory and a training directory where each has subfolders regarding GF, RONA etc. In the new layout for the command line scripts, the method is the top layer and fitting/training/validation are subdirectories of the method.
notebook directory layout (note a lot of the times I'll print out full paths when the code actually saves things so i don't have to do mental gymnastics with the op.join / op.dirname stuff):
DIR
|
|------ fitting
| |
| |------ gradient_forests
| | |------garden_files # garden_dir = make_dir(op.join(DIR, 'fitting/gradient_forests/garden_files'))
| | |
| | |------fitting_outfiles # fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles'))
|
|------ training
| |
| |------ gradient_forests
| | |------garden_files # training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')
offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns
Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.
This is because I parallelized the fitting to each common garden, so I haven't combined fitted files (and I can therefore read them in in parallel as well). So for this file that you referenced, are offsets to garden 100 (100 is in the filename). Note also that I almost always use row names instead of default indices (it looks like the output is the first 5 and last 5 rows of the file), I'm always scared indices will get shuffled so I almost always will use names in place for look-ups etc - here the row names are subpopIDs.
TODO
— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FModelValidationProgram%2FMVP-offsets%2Fissues%2F14%23issuecomment-1101523933&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945655870786%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZCXXzmUUAca%2BIJ5Es2Umei7ZJHe5W%2FyN1scOrgS3Pt8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABUNI3OUMDRBVZ4KKGFUC4DVFWBSHANCNFSM5TVXJFYQ&data=04%7C01%7Ck.lotterhos%40northeastern.edu%7C9d1e3af40b234f0a779108da2154df7c%7Ca8eec281aaa34daeac9b9a398b9215e7%7C0%7C0%7C637858945656027029%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=r%2B99q%2BXFBsNNnXMUKDlAvSB%2FZcp1KciqPW7FFYtFeuc%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>
"fit trained models from gradient forests to the climate of a transplant location" "Given a trained gradient forest, fit model to input climate data,
garden_data
." Confusing: "fit to common gardens" I think what I find confusing is the use of the word "fit" here, which is used in reference to training, but you use it here regarding making a prediciton?The file directory and setup needs explantion: garden_dir = makedir(op.join(DIR, 'fitting/gradient_forests/garden_files')) fitting_dir = makedir(op.join(op.dirname(garden_dir), 'fitting_outfiles')) training_outdir = op.join(DIR, 'training/gradient_forests/training_outfiles')
Maybe it was because I was not reviewing the final script, but the output from testing the fitting script wasn't what I was expecting tting/gradient_forests/fitting_outfiles/1231094_pooled_all_100_gradient_forest_offset.txt")
offset 1 0.716106 2 0.716106 3 0.716105 4 0.716105 5 0.716105 96 0.000645 97 0.000509 98 0.000360 99 0.000221 100 0.000000 100 rows × 1 columns
Why is this 100 rows x 1 column? Each source location (100) should have an offset prediction to a different common garden (100), so I was expecting a 100 x 100 matrix.