Closed BauerLab closed 2 years ago
Hi @lm-fsng what do you think about the spec above. In particular how would you see getting the importances of the covariates back (and is this even necessary)
Hi @lm-fsng I have included the changes we discussed. Please confirm that you are happy with the current spec. And also please follow up with Rob on whether the covariates need to be included in p-value caluclation.
Hi @piotrszul and @rocreguant , just spoke to Rob about the covariates and the p-value calculation. He said that the continuous covariates would be systematically biased to be more important, which would 'push' the genotypes down on the variable importance scale. This would result in less significant SNPs if we include the covariates in the p-value calculation. The workaround that is to ignore the covariates and just use the variable importance of the SNPs in the p-value method.
Implement covariates in vs hail interface function
varspark.hail.random_forest_model
with the interface analogous to hail'slogistic_regression_rows()
methos (see: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows)Currently
random_forest_model
takes thecovariates
argument but it is ignored.The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.
Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method
covariates_importance()
onRandomForestModel
that returns a table with the following schema:indexed with the 'covariate'.
So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):
Then the code with the use of covariates
age
,PC0
andPC1
should look like this:Output:
Things to consider:
Implementation notes
The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.
The python example can be based on the
examples/local_run-importance-ch22_with_pheno.sh
. The transposed version of thedata/chr22_1000_pheno-wide.csv
may need to be created to support this (that should also include the classification response variable).For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)