Make the covariate calculation part of RF

BauerLab commented 3 years ago

Implement covariates in vs hail interface function varspark.hail.random_forest_model with the interface analogous to hail's logistic_regression_rows() methos (see: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows)

Currently random_forest_model takes the covariates argument but it is ignored.

The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.

Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method covariates_importance() on RandomForestModel that returns a table with the following schema:

'covariate': str - the name of the covariate (as in the input expression)
'importance':float64 - the importance
'splitCount':int64 - the split count

indexed with the 'covariate'.

So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):

samples,score,label, age, PC0,PC1, PC2
HG00096,6,0,64,0.332,0.444,0.45
HG00097,15.5,1,32,0.444,0.334,0.128 
HG00099,8,0,17,0.983,0.344,0.4343
...

Then the code with the use of covariates age, PC0 and PC1 should look like this:

    data = hl.import_vcf(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster.vcf.bgz'))
    labels = hl.import_table(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster_labels.txt'),
                             delimiter=',',
                             types=dict(label='float64', score='float64', age='float64', PC0='float64',
                                 PC1='float64',, PC2='float64'
                             )).key_by('samples')

    mt = data.annotate_cols(hipster=labels[data.s])
    print(mt.count())

    with vshl.random_forest_model(y=mt.hipster.label,
                                  x=mt.GT.n_alt_alleles(), 
                                  covariates = [mt.hipster.age, mt.hipster.PC0, mt.hipster.PC1],
                                  seed=13, mtry_fraction=0.05,
                                  min_node_size=5, max_depth=10) as rf_model:
        rf_model.fit_trees(100, 50)

        ...
        rf_model.covriate_importance().show() # display the covariate importance table

Output:

covariate importance splitCount
age         10.0          200
PC0         1.0             4
...

Things to consider:

in the first instance all covariates can be treated as continuous variables but in the future we could try to use more fine-grained mapping of the hail schema type to VariantSpark feature types which would for example allow to include some covariates as OrderedFactor or Factor(not implemented yet) variables.

Implementation notes

The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.

The python example can be based on the examples/local_run-importance-ch22_with_pheno.sh. The transposed version of the data/chr22_1000_pheno-wide.csv may need to be created to support this (that should also include the classification response variable).

For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)

piotrszul commented 2 years ago

Hi @lm-fsng what do you think about the spec above. In particular how would you see getting the importances of the covariates back (and is this even necessary)

piotrszul commented 2 years ago

Hi @lm-fsng I have included the changes we discussed. Please confirm that you are happy with the current spec. And also please follow up with Rob on whether the covariates need to be included in p-value caluclation.

lm-fsng commented 2 years ago

Hi @piotrszul and @rocreguant , just spoke to Rob about the covariates and the p-value calculation. He said that the continuous covariates would be systematically biased to be more important, which would 'push' the genotypes down on the variable importance scale. This would result in less significant SNPs if we include the covariates in the p-value calculation. The workaround that is to ignore the covariates and just use the variable importance of the SNPs in the p-value method.

aehrc / VariantSpark

Make the covariate calculation part of RF #193