This PR adds updates for flexibility in training up different models for missense badness and MPC depending on input training transcript sets and model components. Plus some other miscellaneous updates.
Major changes:
MPC-related:
Addition of capacity to train MPC models on different input training transcript sets (e.g. all training transcripts vs. training transcripts from a specific fold) and model components
E.g. bringing in missense badness scores trained on a specific training set in construction of an MPC model for that specific training set
Turning MPC-related resource definitions into functions that can return different paths for different input training transcript sets and model components
Addition of support for utils functions in the pipeline file calculate_mpc.py
Addition of capacity to train MPC models using a specific input formula or by trying out models from mix-and-match of a set of input variables/components and keeping the best one.
The multiple functionalities previously contained within run_regressions have now been separated into run_glm (for just fitting a single regression), get_min_aic_model (to get the model with the minimum AIC from mix-and-matching an input set of variables) and run_regressions (the top-level function).
Addition of capacity to annotate MPC scores onto a given Table of input variants using the MPC release HT (containing all VEP context missense variants and their MPC scores) or computing scores from a saved model without using a prewritten release HT. This allows us to try out different models without the burden of writing out the entire length of the context HT for each model.
Restructuring of calculate_fitted_scores for code efficiency and flexibility for different models
Removal of the version of the MPC release Table that has multiple rows for variants in multiple transcripts. Only the "deduplicated" version of this Table is now retained.
This necessitated some tinkering with how fitted scores are annotated in calculate_fitted_scores i.e. handling on what happens when a variant is located in multiple transcripts.
Addition of a suffix label on temporary files (temp_label) generated during the process of model generation to avoid conflicting file writes. (This will be further updated in later commits to be converted to a subfolder structure.)
Missense badness-related:
Removal of code related to creation of missense badness models with anything other than training transcripts, i.e. testing transcripts or validation transcripts from a specific fold. This was mistakenly added in previous PRs.
Code efficiency improvements in prepare_amino_acid_ht and calculate_misbad
Minor changes:
Addressing code from previous PRs:
Addition of a temporary save point for an aggregation that had been made for logging purposes on the HC LoF filter. This update means that the aggregation does not have to be performed every time prepare_amino_acid_ht is run (e.g., useful when this function is rerun with different training transcript sets).
Merging of functions returning paths to training, validation, and test transcripts in reference_data.py
Removal of the overwrite_output parameter which was a vestige from previous experimentation with RMC regions. All output resources are now always written out (overwrite=True). This and the points regarding line spacing and documentation should be the only changes made to constraint.py and regional_constraint.py (code for RMC regions and not missense badness or MPC).
Addition of subfolder structure to paths for missense badness and MPC resources in rmc.py, so that resources corresponding to models trained on a particular transcript set are located in the same folder.
Formatting/spacing for various lines
Updates to documentation for various function/argparse arguments
This PR adds updates for flexibility in training up different models for missense badness and MPC depending on input training transcript sets and model components. Plus some other miscellaneous updates.
Major changes:
calculate_mpc.py
run_regressions
have now been separated intorun_glm
(for just fitting a single regression),get_min_aic_model
(to get the model with the minimum AIC from mix-and-matching an input set of variables) andrun_regressions
(the top-level function).calculate_fitted_scores
for code efficiency and flexibility for different modelscalculate_fitted_scores
i.e. handling on what happens when a variant is located in multiple transcripts.temp_label
) generated during the process of model generation to avoid conflicting file writes. (This will be further updated in later commits to be converted to a subfolder structure.)prepare_amino_acid_ht
andcalculate_misbad
Minor changes:
prepare_amino_acid_ht
is run (e.g., useful when this function is rerun with different training transcript sets).reference_data.py
overwrite_output
parameter which was a vestige from previous experimentation with RMC regions. All output resources are now always written out (overwrite=True
). This and the points regarding line spacing and documentation should be the only changes made toconstraint.py
andregional_constraint.py
(code for RMC regions and not missense badness or MPC).rmc.py
, so that resources corresponding to models trained on a particular transcript set are located in the same folder.