emilycantrell / stork_oracle_cbs

0 stars 0 forks source link

Set up infrastructure for running on OSSC #4

Open emilycantrell opened 3 months ago

emilycantrell commented 3 months ago

Emily will work on this.

Questions Emily will ask Flavio and Malte:

emilycantrell commented 3 months ago

NOTES

These are notes from a working meeting, not fully cleaned up.

The current steps in the code are:

1, 2, and 3, just need to be run once. Then files in step 4 will be run multiple times, with different sample sizes, subgroups, etc.

The arguments that control the way step 4 is run should correspond to some of the column names in the export file, e.g.:

A function might look something like this pseudocode:

for (train*selection*model combinations) {
fit(train = x, selection = x, model = x, eval_data= c(x,y,z)) { # different eval data sets correspond to different subgroups
for(x,y,z eval data)  {
get_score() #  This function takes a fitted model and a given eval set & calculates various performance statistics 
}
}
}

Think about whether any data prep steps that are currently embedded in the modeling files can be moved to a separate step (if they are shared between model types).

msalganik commented 3 months ago

@emilycantrell Awesome. When you talk to Flavio and Malte over email, please cc Lisa. She likes to have visibility into that, at least at the beginning.

I hope that Cruijff can learn from Stork Oracle about OSSC, and vice versa.

emilycantrell commented 2 months ago

Proposed spec for running code on OSSC:

@HanzhangRen and I discussed the infrastructure for running code on OSSC. Here is a proposed spec. We want to set things up so that once Flavio and Malte transfer the code to OSSC, all they have to do is press one button. We propose having two main files that control all the other files: jobfile and run_all. @msalganik @jcperdomo we'd love your feedback on this.

jobfile

jobfile is a file in which we specify various arguments to control the major characteristics of the job (any characteristics related to central concepts we want to test, or that are critical to controlling the computational size of the job). The arguments include which samples to use (in terms of sample size and demographic subgroup), which sampling files to use (i.e., which seed was used in sampling), which models to use (plus specifying the specific file version we want), and perhaps other details, TBD. Arguments can be listed as a vector and the job will run all combinations of the vectors. For example:

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv, pmt_train_and_evaluation_samples_seed_2.csv) modeling_files = c(catboost_2024-08-27.R, xgboost_2024-08-26.R , elastic_net_2024-08-26.R) train_samples = c(train_sample_n_100, train_sample_n_1000, train_sample_2000, train_sample_n_3000) evaluation_samples = c(evaluation_sample_n_1000, male_sample_n_1000, female_sample_n_1000)

(We would not necessarily want to run this specific combination of train & eval samples; this is just an example of the format.)

run_all

run_all is a file that specifies the order in which the other files should be run. Files will be run in this order:

1) jobfile

Steps 1-4 will just need to be run once for a given job. Then steps 5, 6, and 7 will be looped over multiple times as needed, based on the inputs in the jobfile.

miscellaneous notes

emilycantrell commented 2 months ago

Here is an updated list of arguments to put in the jobfile, which Tommy and I discussed this afternoon while working on the spec for the results file that will be exported. Next to each argument, I wrote examples of what the argument could contain.

This is not yet finalized, it's just our most recent draft.

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv)
models = c(catboost, xgboost, elastic_net)
train_samples = c(train_sample_n_100, train_sample_n_1000)
selection_samples = c(eval_selection, eval_selection_female, eval_selection_male)
test_samples = c(eval_test, eval_test_female, eval_test_male)
feature_sets = list(
    c(persoontab, householdbus, prefer_train), 
    c(persoontab, householdbus), 
    c(sex_and_birthyear)
    )
metrics_for_selecting_winning_hyperparameters = c(logloss, AUC) 

catboost_file = catboost_2024-08-27.R
xgboost_file = xgboost_2024-08-26.R 
elastic_net_file = elastic_net_2024-08-26.R

Maybe: include file paths to data files (this is only necessary if the file paths differ in OSSC vs. CBS; make sure to think about train.csv when we figure out this detail)

Open question: Should we always choose winning hyperparameter values based on logloss (or some other metric that we want to use 100% of the time for winner selection)? Or do we want to be able to specify in the jobfile which metric to use for for choosing the winner? (above, I included the line metrics_for_selecting_winning_hyperparameters in case we want this)

emilycantrell commented 2 months ago

Discussed 2024-08-29 (Tommy, Emily, and Malte meeting):

Discussed 2024-08-29 (Tommy & Emily meeting):

emilycantrell commented 2 months ago

Discussed 2024-08-29 (Stork Oracle meeting):

msalganik commented 2 months ago

Also, @varunsatish given our conversation this afternoon, you might want to read this before the meeting on Friday. Very related to what we were talking about.

varunsatish commented 2 months ago

@emilycantrell Jobfile is an awesome idea. We are going to use something like this in Cruijff too!

emilycantrell commented 2 months ago

Meeting notes 2024-08-24 (EC HR VS MS):

Think about how to best track which hyperparameter grids were used in a certain job. Do we want a separate file with hyperparameter grids? that would allow us to track versioning on the grids, separate from versioning on other model code changes.

emilycantrell commented 2 months ago

Tommy & Emily discussed 2024-09-01: If we need control over the size of the jobfile due to export rules (i.e., cost of number of cells), add a jobfile option to specify whether non-winning hyperparameter rows should be saved. (If we want extra control, we could make this even more specific, e.g., save top 30% best hyperparameter draws)

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv)
models = c(catboost, xgboost, elastic_net)
train_samples = c(train_sample_n_100, train_sample_n_1000)
selection_samples = c(eval_selection, eval_selection_female, eval_selection_male)
test_samples = c(eval_test, eval_test_female, eval_test_male)
feature_sets = list(
    c(persoontab, householdbus, prefer_train), 
    c(persoontab, householdbus), 
    c(sex_and_birthyear)
    )
metrics_for_selecting_winning_hyperparameters = c(logloss, AUC) 
save_only_winning_hyperparameter_draw_results = FALSE

catboost_file = catboost_2024-08-27.R
xgboost_file = xgboost_2024-08-26.R 
elastic_net_file = elastic_net_2024-08-26.R
HanzhangRen commented 2 months ago

Think about how to best track which hyperparameter grids were used in a certain job. Do we want a separate file with hyperparameter grids? that would allow us to track versioning on the grids, separate from versioning on other model code changes.

I thought about this a little. My personal preference of how to handle hyperparameter grids is to include them as part of a nested list in the job file. You would have a list of models, each model would have a list of hyperparameters, and each hyperaparameter would have a list of values. It feels nice to have everything we iterate over all in one place in the job file. It would be ideal that when the code is finished, there is only one file we would ever need to edit.

emilycantrell commented 2 months ago

My personal preference of how to handle hyperparameter grids is to include them as part of a nested list in the job file. You would have a list of models, each model would have a list of hyperparameters, and each hyperaparameter would have a list of values. It feels nice to have everything we iterate over all in one place in the job file. It would be ideal that when the code is finished, there is only one file we would ever need to edit.

My personal preference would be to put the grid in a separate file from the jobfile, because I think versioning would be easiest this way. However, I don't think there is an objective right answer, so I am willing to defer to you on this @HanzhangRen. Let's do the first draft of the infrastructure by putting the grid in the jobfile as you described. Then we'll probably keep it that way; however, I'd like to figure out how all the other pieces fit together and have a draft of the setup before we 100% finalize this decision, if that sounds okay to you?

Note: Above, we wrote: "jobfile is a file in which we specify various arguments to control the major characteristics of the job (any characteristics related to central concepts we want to test, or that are critical to controlling the computational size of the job)." I think when we originally wrote this, we were trying to come up with a definition that justifies NOT including the hyperparameter grid. Hyperparameter values are not central concepts we want to test. However, they do affect the computational size of the job. So by our description of the jobfile above, the grid can reasonably be included in the jobfile.

emilycantrell commented 2 months ago

Tommy drafted an overview of what run_all will look like. Now that we have the overview, next steps are:

emilycantrell commented 2 months ago

Emily work on code to check that the format of a given jobfile is correct

I put a draft of this in the commit above. The goal of this code is to ensure that when we start a job, the jobfile is in the correct format and only contains valid entries, so that we don't waste time on a jobfile that won't work. This will be especially important when we submit jobfiles for Flavio and Malte to run while we are in the U.S.

@HanzhangRen Once you do the next export of the code from the CBS environment, I'll read your code and make any necessary adjustments to this jobfile validation code. Then at some point I will want to talk with you about a few details of the validation file, but that can wait until after you are back to the U.S.

This jobfile validation code can be imported into the CBS environment at some point if desired, but it will mainly be used outside of the CBS environment, since we will write jobfiles outside of the CBS environment and validate them before sending them to Flavio and Malte.