Closed oalavijeh closed 1 year ago
Hi Omid,
Thanks for reaching out. If this script has no issues and is run on cloud with a sufficient number of workers, it should take no more than 10-20 minutes to download summary statistics for one phenotype for a particular variant class. So, two items:
1) Are you running locally or on the cloud?
2) Here is a script for downloading ultra-rare pLoF and synonymous variants for a single phenotype (standing height, phenocode = 50); you can adjust the phenocode as needed. This script should have no issues, so if it runs for you for a long time it is likely an issue with the compute environment (locally vs. cloud).
import hail as hl
def bhr_genebass_variant_to_gene_lof_syn(var_type, upper_bound, lower_bound, name):
#Load variants and phenotype files
genebass_variant = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/variant_results.mt')
genebass_variant = genebass_variant.rename({'AF.Cases': 'AF_Cases', 'AF.Controls': 'AF_Controls'})
genebass_variant = genebass_variant.filter_cols(genebass_variant.phenocode == 50)
#Filter variants and count
genebass_variant = genebass_variant.filter_rows((genebass_variant.call_stats.AF < upper_bound) & (genebass_variant.call_stats.AF >= lower_bound), keep=True)
genebass_variant = genebass_variant.filter_rows(genebass_variant.annotation == var_type)
nvar = genebass_variant.count()[0]
#Add BHR-specific parameters
genebass_variant = genebass_variant.annotate_entries(af_overall = ((genebass_variant.n_cases*genebass_variant.AF_Cases) + (genebass_variant.n_controls*genebass_variant.AF_Controls))/(genebass_variant.n_cases + genebass_variant.n_controls),
prevalence = genebass_variant.n_cases/(genebass_variant.n_cases + genebass_variant.n_controls))
genebass_variant = genebass_variant.annotate_entries(beta_binary = ((2*genebass_variant.prevalence*(genebass_variant.AF_Cases - genebass_variant.af_overall))/hl.sqrt(2*genebass_variant.af_overall*(1-genebass_variant.af_overall)*genebass_variant.prevalence*(1-genebass_variant.prevalence))))
genebass_variant = genebass_variant.annotate_entries(variant_variance = 2*genebass_variant.af_overall*(1-genebass_variant.af_overall))
genebass_variant = genebass_variant.annotate_entries(beta_per_allele = genebass_variant.beta_binary/(hl.sqrt(genebass_variant.variant_variance)))
#Export gene summary statistics file
genebass_variant.entries().export('[YOUR PATH]/bhr_'+str(var_type)+'_nvar'+str(nvar)+'_low'+str(lower_bound)+'_high'+str(upper_bound)+'_'+str(name)+'.txt.bgz')
#variant frequency and function filters
upper = [1e-5]
lower = [0]
names = ['group1']
var_type = ['pLoF', 'synonymous']
for variant_type in range(len(var_type)):
for grouping in range(len(names)):
bhr_genebass_variant_to_gene_lof_syn(var_type[variant_type], upper[grouping], lower[grouping], names[grouping])
Let us know if it helps, thanks! Dan
Dear Dan,
Thanks as always for the help! I think it was running locally despite me setting up gcloud as I hadn't set up a "region" for hail. Will close for now as it's not a BHR specific issue and get back to you with ongoing issues.
All the best
Omid
Dear BHR team,
I am running the genebass.py script as below where the bhr_phenotypes file is similar to the one you provided in the paper but with the two phenotypes I am interested in.
The script has been running for 10 hours on HAIL (see screenshot at the bottom) and I was wondering how long you would expect such a query to run for two phenotypes?
Many thanks