Implement cross-validation like training pipelines/cv_training
train the model on n-1 folds of the samples
compute the burdens on the held-out fold using this models
repeat this for all n-folds
for this, allow a sample file to be passed to dense_gt.py
Allow different orders of samples in phenotype_df and genotypes.h5 in dense_gt.py
so far, samples in phenotype_df and genotypes.h5 had to be in the same order.
this is changed now by introducing an additional index map for the genotypes.h5, which retrieves samples in the order of self.samples
Average burdens from multiple repeats and run association testing afterwards (deeprvat/associate.py) (as opposed to running the association testing on each cv individually)
Restructure the snakefiles as it had been already started in the main branch
require baseline result only for training phenotypes
Update evaluate.py
no repeats required any more
make bonferroni correction the default multiple testing correction
don't combine baseline discoveries with DeepRVAT discoveries
Re-test also the 'seed genes' since we don't evaluate on the same sample-gene combinations any more as we trained on (thanks to the cv-based training procedure)
use additional covariates age2 and age*sex and correct for statin usage
updated the example data to have these fields
update example data to have bit sample ids in genotypes.h5 and string sample ids in phenotype_df
implement conditional analysis for common variants pipelines/association_testing_control_for_common_variants.snakefile
Testing
quite extensively tested (many reasonable experiments done) but still need to check why github tests fail so far
What
Implement cross-validation like training
pipelines/cv_training
dense_gt.py
Allow different orders of samples in
phenotype_df
andgenotypes.h5
indense_gt.py
phenotype_df
andgenotypes.h5
had to be in the same order.genotypes.h5
, which retrieves samples in the order ofself.samples
Average burdens from multiple repeats and run association testing afterwards (
deeprvat/associate.py
) (as opposed to running the association testing on each cv individually)Restructure the snakefiles as it had been already started in the main branch
Update
evaluate.py
use additional covariates
age2
andage*sex
and correct for statin usageupdate example data to have bit sample ids in
genotypes.h5
and string sample ids inphenotype_df
implement conditional analysis for common variants
pipelines/association_testing_control_for_common_variants.snakefile
Testing