Expected length of time for genebass query

oalavijeh commented 1 year ago

Dear BHR team,

I am running the genebass.py script as below where the bhr_phenotypes file is similar to the one you provided in the paper but with the two phenotypes I am interested in.

The script has been running for 10 hours on HAIL (see screenshot at the bottom) and I was wondering how long you would expect such a query to run for two phenotypes?

Many thanks

import hail as hl

def bhr_genebass_variant_to_gene_lof_syn(var_type, upper_bound, lower_bound, name):
    #Load variants and phenotype files
    genebass_variant = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/variant_results.mt')
    genebass_variant = genebass_variant.rename({'AF.Cases': 'AF_Cases', 'AF.Controls': 'AF_Controls'})
    genebass_variant = genebass_variant.key_cols_by(genebass_variant.phenocode, genebass_variant.coding_description)
    bhr_phenotypes = hl.import_table('/home/omid/Documents/Projects/stones/reviewer_data/bhr/renal_phenotypes.txt.csv', #This sheet is Supplementary Table 4 from the manuscript
                                types = {'n_cases':hl.tint32, 'n_controls':hl.tint32, 'n_eff':hl.tint32})
    bhr_phenotypes = bhr_phenotypes.key_by(bhr_phenotypes.phenocode, bhr_phenotypes.coding_description)
    genebass_variant = genebass_variant.semi_join_cols(bhr_phenotypes)
    genebass_variant = genebass_variant.annotate_cols(**bhr_phenotypes[genebass_variant.phenocode, genebass_variant.coding_description])
    genebass_variant.filter_cols(genebass_variant.phenotype_key == "50NA", keep = True)

    #Filter variants and count
    genebass_variant = genebass_variant.filter_rows((genebass_variant.call_stats.AF < upper_bound) & (genebass_variant.call_stats.AF >= lower_bound), keep=True)
    genebass_variant = genebass_variant.filter_rows(genebass_variant.annotation == var_type)
    nvar = genebass_variant.count()[0]

    #Add BHR-specific parameters
    genebass_variant = genebass_variant.annotate_entries(af_overall = ((genebass_variant.n_cases*genebass_variant.AF_Cases) + (genebass_variant.n_controls*genebass_variant.AF_Controls))/(genebass_variant.n_cases + genebass_variant.n_controls),
                    prevalence = genebass_variant.n_cases/(genebass_variant.n_cases + genebass_variant.n_controls))
    genebass_variant = genebass_variant.annotate_entries(beta_binary = ((2*genebass_variant.prevalence*(genebass_variant.AF_Cases - genebass_variant.af_overall))/hl.sqrt(2*genebass_variant.af_overall*(1-genebass_variant.af_overall)*genebass_variant.prevalence*(1-genebass_variant.prevalence))))

    genebass_variant = genebass_variant.annotate_entries(variant_variance = hl.if_else(genebass_variant.trait_type == "continuous",
                                                            ((1/(genebass_variant.SE*hl.sqrt(genebass_variant.n_eff)))**2),
                                                            2*genebass_variant.af_overall*(1-genebass_variant.af_overall)))

    genebass_variant = genebass_variant.annotate_entries(beta_per_allele = hl.if_else(genebass_variant.trait_type == "continuous",
                                                          genebass_variant.BETA,
                                                          genebass_variant.beta_binary/(hl.sqrt(genebass_variant.variant_variance))))

    #Export gene summary statistics file
    genebass_variant.entries().export('~/dec_bhr_ms_variant_ss_400k_final_thin_withnullburden_'+str(var_type)+'_nvar'+str(nvar)+'_low'+str(lower_bound)+'_high'+str(upper_bound)+'_'+str(name)+'.txt.bgz')

#variant frequency and function filters
upper = [1e-5,0.0001,0.001]
lower = [0,1e-5,0.0001]
names = ['group1','group2','group3']
var_type = ['pLoF','missense']

for variant_type in range(len(var_type)):
    for grouping in range(len(names)):
        bhr_genebass_variant_to_gene_lof_syn(var_type[variant_type], upper[grouping], lower[grouping], names[grouping])

danjweiner commented 1 year ago

Hi Omid,

Thanks for reaching out. If this script has no issues and is run on cloud with a sufficient number of workers, it should take no more than 10-20 minutes to download summary statistics for one phenotype for a particular variant class. So, two items:

1) Are you running locally or on the cloud?

2) Here is a script for downloading ultra-rare pLoF and synonymous variants for a single phenotype (standing height, phenocode = 50); you can adjust the phenocode as needed. This script should have no issues, so if it runs for you for a long time it is likely an issue with the compute environment (locally vs. cloud).

import hail as hl

def bhr_genebass_variant_to_gene_lof_syn(var_type, upper_bound, lower_bound, name):
    #Load variants and phenotype files
    genebass_variant = hl.read_matrix_table('gs://ukbb-exome-public/500k/results/variant_results.mt')
    genebass_variant = genebass_variant.rename({'AF.Cases': 'AF_Cases', 'AF.Controls': 'AF_Controls'})
    genebass_variant = genebass_variant.filter_cols(genebass_variant.phenocode == 50)

    #Filter variants and count
    genebass_variant = genebass_variant.filter_rows((genebass_variant.call_stats.AF < upper_bound) & (genebass_variant.call_stats.AF >= lower_bound), keep=True)
    genebass_variant = genebass_variant.filter_rows(genebass_variant.annotation == var_type)
    nvar = genebass_variant.count()[0]

    #Add BHR-specific parameters
    genebass_variant = genebass_variant.annotate_entries(af_overall = ((genebass_variant.n_cases*genebass_variant.AF_Cases) + (genebass_variant.n_controls*genebass_variant.AF_Controls))/(genebass_variant.n_cases + genebass_variant.n_controls),
                    prevalence = genebass_variant.n_cases/(genebass_variant.n_cases + genebass_variant.n_controls))
    genebass_variant = genebass_variant.annotate_entries(beta_binary = ((2*genebass_variant.prevalence*(genebass_variant.AF_Cases - genebass_variant.af_overall))/hl.sqrt(2*genebass_variant.af_overall*(1-genebass_variant.af_overall)*genebass_variant.prevalence*(1-genebass_variant.prevalence))))

    genebass_variant = genebass_variant.annotate_entries(variant_variance = 2*genebass_variant.af_overall*(1-genebass_variant.af_overall))                                                                                                                                            
    genebass_variant = genebass_variant.annotate_entries(beta_per_allele = genebass_variant.beta_binary/(hl.sqrt(genebass_variant.variant_variance)))

    #Export gene summary statistics file
    genebass_variant.entries().export('[YOUR PATH]/bhr_'+str(var_type)+'_nvar'+str(nvar)+'_low'+str(lower_bound)+'_high'+str(upper_bound)+'_'+str(name)+'.txt.bgz')

#variant frequency and function filters
upper = [1e-5]
lower = [0]
names = ['group1']
var_type = ['pLoF', 'synonymous']

for variant_type in range(len(var_type)):
    for grouping in range(len(names)):
        bhr_genebass_variant_to_gene_lof_syn(var_type[variant_type], upper[grouping], lower[grouping], names[grouping])

Let us know if it helps, thanks! Dan

oalavijeh commented 1 year ago

Dear Dan,

Thanks as always for the help! I think it was running locally despite me setting up gcloud as I hadn't set up a "region" for hail. Will close for now as it's not a BHR specific issue and get back to you with ongoing issues.

All the best

Omid

ajaynadig / bhr

Expected length of time for genebass query #8