Set up infrastructure for running on OSSC

emilycantrell commented 3 months ago

[x] Talk with Flavio and Malte about how using OSSC works
[ ] Set up a master run_all file that we can give to Flavio and Malte so they can run our code on the supercomputer. This should run all files from the intake of raw data to the creation of the export file using just one click.
[ ] Adjust code to make use of parallelization on OSSC

Emily will work on this.

Questions Emily will ask Flavio and Malte:

are package versions identical to the RA environment?
are file paths identical to the RA environment?
how do we get code from the RA environment to OSSC?
probably more questions, TBD

emilycantrell commented 3 months ago

NOTES

These are notes from a working meeting, not fully cleaned up.

The current steps in the code are:

1) filter_to_train_set
2) feature_engineering
3) create_metadata
4) (modeling files -- depends on the model)

1, 2, and 3, just need to be run once. Then files in step 4 will be run multiple times, with different sample sizes, subgroups, etc.

The arguments that control the way step 4 is run should correspond to some of the column names in the export file, e.g.:

training sample size
selection sample size
eval sample size(?)
indicator for whether tuning/selection data should be separate from evaluation data(?)
training & selection sample composition (i.e., demographic subgroup composition)
eval sample composition (i.e., demographic subgroup composition)
model (e.g., catboost/xgboost)
feature set (which files do we use as inputs)

A function might look something like this pseudocode:

for (train*selection*model combinations) {
fit(train = x, selection = x, model = x, eval_data= c(x,y,z)) { # different eval data sets correspond to different subgroups
for(x,y,z eval data)  {
get_score() #  This function takes a fitted model and a given eval set & calculates various performance statistics 
}
}
}

Think about whether any data prep steps that are currently embedded in the modeling files can be moved to a separate step (if they are shared between model types).

msalganik commented 3 months ago

@emilycantrell Awesome. When you talk to Flavio and Malte over email, please cc Lisa. She likes to have visibility into that, at least at the beginning.

I hope that Cruijff can learn from Stork Oracle about OSSC, and vice versa.

emilycantrell commented 2 months ago

Proposed spec for running code on OSSC:

@HanzhangRen and I discussed the infrastructure for running code on OSSC. Here is a proposed spec. We want to set things up so that once Flavio and Malte transfer the code to OSSC, all they have to do is press one button. We propose having two main files that control all the other files: jobfile and run_all. @msalganik @jcperdomo we'd love your feedback on this.

jobfile

jobfile is a file in which we specify various arguments to control the major characteristics of the job (any characteristics related to central concepts we want to test, or that are critical to controlling the computational size of the job). The arguments include which samples to use (in terms of sample size and demographic subgroup), which sampling files to use (i.e., which seed was used in sampling), which models to use (plus specifying the specific file version we want), and perhaps other details, TBD. Arguments can be listed as a vector and the job will run all combinations of the vectors. For example:

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv, pmt_train_and_evaluation_samples_seed_2.csv) modeling_files = c(catboost_2024-08-27.R, xgboost_2024-08-26.R , elastic_net_2024-08-26.R) train_samples = c(train_sample_n_100, train_sample_n_1000, train_sample_2000, train_sample_n_3000) evaluation_samples = c(evaluation_sample_n_1000, male_sample_n_1000, female_sample_n_1000)

(We would not necessarily want to run this specific combination of train & eval samples; this is just an example of the format.)

run_all

run_all is a file that specifies the order in which the other files should be run. Files will be run in this order:

1) jobfile

this file supplies arguments that will be used in other files 2) filter_to_prefer_train_and_eval_set
filters all datasets to only the official train and eval set, i.e., all cases in train.csv 3) feature_engineering_that_is_shared_across_models
Let's discuss whether there is a better name for this. Right now, I think maybe everything that happens in this step could be summarized as "feature creation", i.e., taking householdbus and data wrangling it into a format we can use. In contrast, other data prep steps like scaling and imputation (or alternatives to imputation) differ across models. 4) create_metadata
this step creates metadata indicating which variables are continuous, binary, or categorical, so that we can tell the models which variables are categorical and handle automated data prep appropriately. 5) train_and_eval_sampling
reads in the arguments from the jobfile, reads in pmt_train_and_evaluation_samplesseed*.csv, and filters to whichever samples we want to use
if the vectors of samples contain multiple samples, we will loop over the train_and_eval_sampling step and modeling step multiple times
Tommy and I discussed whether train_and_eval_sampling should be part of the same step as filter_to_prefer_train_and_eval_set. We think they should be separate. Reasoning: We want to do filter_to_prefer_train_and_eval_set before anything else, in order to be 100% certain we don't touch anything outside of the official train set. However, we don't want to do train_and_eval_sampling until later, because a several intermediate steps will apply to all train and eval samples, and we want to do those intermediate steps just once in a way that is shared across all samples. 6) modeling (this is not a file name, but a description of a step that draws on multiple files)
- multiple models may be run in this step, depending which modeling files were specified in the jobfile
- We are considering whether the run_all file should trigger each model file to run (e.g., trigger catboost_2024-08-27.R to run, then trigger xgboost.R to run), or whether each modeling file should generate a function, so that run_all can call the functions. 7) evaluation
evaluates the models on all samples listed in evaluation_samples (this might be after each model individually, or after all models have been run, TBD)

Steps 1-4 will just need to be run once for a given job. Then steps 5, 6, and 7 will be looped over multiple times as needed, based on the inputs in the jobfile.

miscellaneous notes

We considered whether hyperparameter grids should be part of the jobfile, and/or get their own separate file. We decided we will probably just put the hyperparameter grid for each model in the same file as the model itself (e.g., catboost grid goes in the catboost modeling file), because this seems simplest. We'll probably make some initial adjustments to the grids to ensure our choices for what to include in the grid are reasonable, but after that, we expect the grids to be relatively set in stone, so we don't expect to want to regularly change them in the way we will want to change the things we are putting as arguments in the jobfile.

emilycantrell commented 2 months ago

Here is an updated list of arguments to put in the jobfile, which Tommy and I discussed this afternoon while working on the spec for the results file that will be exported. Next to each argument, I wrote examples of what the argument could contain.

This is not yet finalized, it's just our most recent draft.

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv)
models = c(catboost, xgboost, elastic_net)
train_samples = c(train_sample_n_100, train_sample_n_1000)
selection_samples = c(eval_selection, eval_selection_female, eval_selection_male)
test_samples = c(eval_test, eval_test_female, eval_test_male)
feature_sets = list(
    c(persoontab, householdbus, prefer_train), 
    c(persoontab, householdbus), 
    c(sex_and_birthyear)
    )
metrics_for_selecting_winning_hyperparameters = c(logloss, AUC) 

catboost_file = catboost_2024-08-27.R
xgboost_file = xgboost_2024-08-26.R 
elastic_net_file = elastic_net_2024-08-26.R

Maybe: include file paths to data files (this is only necessary if the file paths differ in OSSC vs. CBS; make sure to think about train.csv when we figure out this detail)

Open question: Should we always choose winning hyperparameter values based on logloss (or some other metric that we want to use 100% of the time for winner selection)? Or do we want to be able to specify in the jobfile which metric to use for for choosing the winner? (above, I included the line metrics_for_selecting_winning_hyperparameters in case we want this)

emilycantrell commented 2 months ago

Discussed 2024-08-29 (Tommy, Emily, and Malte meeting):

Rather than running all parts of the code starting from the raw data files on OSSC, Tommy will create a single csv with all features within the CBS RA environment. Then he will pass this csv to Malte along with the modeling code. The modeling code will draw on this single csv.
Emily will set up run_all so that Malte can specify the file path where the files are stored from the command line

Discussed 2024-08-29 (Tommy & Emily meeting):

Add metadata columns indicating different feature sets (e.g., sex_and_birthyear column is a binary vector, where all rows are 0 except 1s for sex and birthyear rows)
The data_type metadata column should have the following options:
- continuous
- categorical_or_binary_original_variable
- binary_one_hot
- binary_one_hot_base_level
Then using the data_type column:
- catboost file should filter to continuous and categorical_or_binary_original_variable
- xgboost should filter to continuous, binary_one_hot, and binary_one_hot_base_level
- elastic net should filter to continuous and binary_one_hot
  - Due to different sample sizes, add a step that removes columns with 0 variance (TBD where this belongs)

emilycantrell commented 2 months ago

Discussed 2024-08-29 (Stork Oracle meeting):

For tuning use logloss or squared loss (both are proper scoring rules; it doesn't matter much which we choose). We want to track both, but we don't need to have a jobfile option to control which one is used in tuning.
Consider whether we want run_all to do all code starting from the raw data, as we had originally planned, to aid in doing runs when Tommy is no longer in Amsterdam.

msalganik commented 2 months ago

Also, @varunsatish given our conversation this afternoon, you might want to read this before the meeting on Friday. Very related to what we were talking about.

varunsatish commented 2 months ago

@emilycantrell Jobfile is an awesome idea. We are going to use something like this in Cruijff too!

emilycantrell commented 2 months ago

Meeting notes 2024-08-24 (EC HR VS MS):

Think about how to best track which hyperparameter grids were used in a certain job. Do we want a separate file with hyperparameter grids? that would allow us to track versioning on the grids, separate from versioning on other model code changes.

emilycantrell commented 2 months ago

Tommy & Emily discussed 2024-09-01: If we need control over the size of the jobfile due to export rules (i.e., cost of number of cells), add a jobfile option to specify whether non-winning hyperparameter rows should be saved. (If we want extra control, we could make this even more specific, e.g., save top 30% best hyperparameter draws)

sampling_files = c(pmt_train_and_evaluation_samples_seed_1.csv)
models = c(catboost, xgboost, elastic_net)
train_samples = c(train_sample_n_100, train_sample_n_1000)
selection_samples = c(eval_selection, eval_selection_female, eval_selection_male)
test_samples = c(eval_test, eval_test_female, eval_test_male)
feature_sets = list(
    c(persoontab, householdbus, prefer_train), 
    c(persoontab, householdbus), 
    c(sex_and_birthyear)
    )
metrics_for_selecting_winning_hyperparameters = c(logloss, AUC) 
save_only_winning_hyperparameter_draw_results = FALSE

catboost_file = catboost_2024-08-27.R
xgboost_file = xgboost_2024-08-26.R 
elastic_net_file = elastic_net_2024-08-26.R

HanzhangRen commented 2 months ago

Think about how to best track which hyperparameter grids were used in a certain job. Do we want a separate file with hyperparameter grids? that would allow us to track versioning on the grids, separate from versioning on other model code changes.

I thought about this a little. My personal preference of how to handle hyperparameter grids is to include them as part of a nested list in the job file. You would have a list of models, each model would have a list of hyperparameters, and each hyperaparameter would have a list of values. It feels nice to have everything we iterate over all in one place in the job file. It would be ideal that when the code is finished, there is only one file we would ever need to edit.

emilycantrell commented 2 months ago

My personal preference of how to handle hyperparameter grids is to include them as part of a nested list in the job file. You would have a list of models, each model would have a list of hyperparameters, and each hyperaparameter would have a list of values. It feels nice to have everything we iterate over all in one place in the job file. It would be ideal that when the code is finished, there is only one file we would ever need to edit.

My personal preference would be to put the grid in a separate file from the jobfile, because I think versioning would be easiest this way. However, I don't think there is an objective right answer, so I am willing to defer to you on this @HanzhangRen. Let's do the first draft of the infrastructure by putting the grid in the jobfile as you described. Then we'll probably keep it that way; however, I'd like to figure out how all the other pieces fit together and have a draft of the setup before we 100% finalize this decision, if that sounds okay to you?

Note: Above, we wrote: "jobfile is a file in which we specify various arguments to control the major characteristics of the job (any characteristics related to central concepts we want to test, or that are critical to controlling the computational size of the job)." I think when we originally wrote this, we were trying to come up with a definition that justifies NOT including the hyperparameter grid. Hyperparameter values are not central concepts we want to test. However, they do affect the computational size of the job. So by our description of the jobfile above, the grid can reasonably be included in the jobfile.

emilycantrell commented 2 months ago

Tommy drafted an overview of what run_all will look like. Now that we have the overview, next steps are:

Emily work on run_training_set.R, run_sampling_file.R, run_feature_set.R
Tommy work on run_models.R and all functions above that in the current draft of run_all
Emily work on code to check that the format of a given jobfile is correct

emilycantrell commented 2 months ago

Emily work on code to check that the format of a given jobfile is correct

I put a draft of this in the commit above. The goal of this code is to ensure that when we start a job, the jobfile is in the correct format and only contains valid entries, so that we don't waste time on a jobfile that won't work. This will be especially important when we submit jobfiles for Flavio and Malte to run while we are in the U.S.

@HanzhangRen Once you do the next export of the code from the CBS environment, I'll read your code and make any necessary adjustments to this jobfile validation code. Then at some point I will want to talk with you about a few details of the validation file, but that can wait until after you are back to the U.S.

This jobfile validation code can be imported into the CBS environment at some point if desired, but it will mainly be used outside of the CBS environment, since we will write jobfiles outside of the CBS environment and validate them before sending them to Flavio and Malte.

emilycantrell / stork_oracle_cbs