farhat-lab / gentb-site

The genTB project, the Django site, variant calling and prediciton pipeline, and mapping pipeline with hooks to two ravens
https://gentb.hms.harvard.edu
Other
8 stars 11 forks source link

Neural net predictor replacement for Tbpredict.R random forest predictor #119

Closed mahafarhat closed 3 years ago

mahafarhat commented 6 years ago

Python script and sample data from prediction from genotypic data (that can be obtained from the vcf) to a prediction in probabilities that can be displayed on the predict page. To replace TBpredict.R portion of the pipeline

This script requires Keras 1.2.0 and TensorFlow 0.10.0

We need to likely develop new code to extract relevant format from VCF

We need to likely develop new code to extract relevant output data into json format for display on the webtool in the API.

saved_model_0319.zip

doctormo commented 6 years ago

I've created a repository and upgraded the script to work with the website:

https://github.com/farhat-lab/wdnn

doctormo commented 6 years ago

The tensorflow requirement is now python 3.6 ONLY which means this code needs converting.

mahafarhat commented 6 years ago

Can we test it using python 3.6, it may be ok.

doctormo commented 6 years ago

tensorflow package has all sorts of issues. I managed to get it working with tensorflow==0.12.1 which is the oldest possible version available. Anything newer would need the models to be regenerated.

mahafarhat commented 6 years ago

After Megapipe addition, second task for David

mahafarhat commented 5 years ago

Avika has written some R code to extract a matrix for resistance prediction from vcf files.

avdixit commented 5 years ago
##This script takes vcf files and identifies presence of predictive variants (author: Roger)
##########################################################################################################################################################
#Important Packages
import vcf
import os
import pandas as pd
import numpy as np
import ast
import time
import sys
import pickle
import shutil

import Bio
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import SeqIO
from StringIO import StringIO
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Seq import MutableSeq
##########################################################################################################################################################

##########################################################################################################################################################

#INPUTS
isolate_tag = sys.argv[1] #INPUT 1 - sample isolate tag 
VCF_path = sys.argv[2] #INPUT 2 - path to VCF file to extract variants from
OUT_directory = sys.argv[3] #INPUT 3 - output directory to save CSV file

##########################################################################################################################################################

#Load DataFrames that contain information for specific variants
##########################################################################################################################################################
DEL_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_deletion_variants.pkl')
INS_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_insertion_variants.pkl')
SNP_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_SNP_variants.pkl')
MSNP_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_MSNP_variants.pkl')
##########################################################################################################################################################

#Load DataFrame for general variant regions
##########################################################################################################################################################
variants_of_interest_broad_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/general_predictive_variants.pkl')
##########################################################################################################################################################

#Use Michael's helper functions to broadly classify any specific mutations detected into a derived mutation category after searching VCFs
##########################################################################################################################################################
# Gene names
genes = ['ahpC', 'alr', 'ddl', 'embA', 'embB', 'embC', 'ethA', 'gid', 'gyrA', 'gyrB', 'inhA', 'iniA',
         'iniB', 'iniC', 'kasA', 'katG', 'murA-rrs', 'fabG1', 'ndh', 'pncA', 'rpoB', 'rpsL', 'rrl',
         'rrs', 'thyA', 'tlyA', 'gyrB-gyrA', 'gyrA-Rv0007', 'iniB-iniA-iniC', 'iniB-iniA', 'iniC-lpqJ',
         'rpoB-rpoC', 'fabG1-inhA', 'rrs-rrl', 'rrl-rrf', 'inhA-hemZ', 'katG-furA', 'kasA-kasB',
         'ahpC-ahpD', 'dfrA-thyA', 'alr-Rv3792', 'embA-embB', 'embB-Rv3796', 'menG-ethA', 'rpsA', 'eis',
         "oxyR'", 'acpM']

# Get gene associated with particular mutation
def get_gene(x):
    splitted = x.split("_")
    if list(set(splitted).intersection(genes)) == []:
        if 'ndhA' in splitted or 'mfd' in splitted or 'whiB6':
            return None
        #return None
        raise Exception(splitted)
    type = '-'.join(list(set(splitted).intersection(genes)))
    return '-'.join(sorted(type.split('-')))

# Get gene dictionary
def get_gene_dict(derived_names):
    gene_dict = {}
    for snp in derived_names:
        if not get_gene(snp):
            continue
        gene_dict.setdefault(get_gene(snp), []).append(snp)
    return gene_dict

# Get dictionary by mutation location and type of mutation
def get_final_dict(gene_dict):
    final_dict = {}
    for gene, muts in gene_dict.iteritems():
        for mut in muts:
            split = mut.split("_")
            if 'P' in split or 'I' in split:
                if 'DEL' in split or 'INS' in split:
                    final_dict.setdefault(gene + '_NC_indel', []).append(mut)
                    # is noncoding indel
                else:
                    final_dict.setdefault(gene + '_NC_snp', []).append(mut)
                    # is noncoding snp
            elif 'F' in split or 'CF' in split:
                final_dict.setdefault(gene + '_F_indel', []).append(mut)
                # is coding frameshift

            #ROGER's EDIT: have to make a modification for rrs & rrl ribosomal proteins since 'N' within these genes doesn't imply they are indel variants
            elif ('CI' in split) or ('N' in split and gene != 'rrl' and gene != 'rrs') or ('NF' in split):
                final_dict.setdefault(gene + '_NF_indel', []).append(mut)
                # is coding not frameshift
            else:
                final_dict.setdefault(gene + '_C_snp', []).append(mut)
                # is coding snp
    return final_dict
##########################################################################################################################################################

#Relevant Information for H37Rv sequence and SNP functional annotation
##########################################################################################################################################################
####### Collect all DNA and Amino Acid sequences corresponding to genes on H37Rv #######
#load reference genome and reference annotation
reference_genome = '/n/data1/hms/dbmi/farhat/bin/work-horse/bin/h37rv.fasta'
for reference_genome in SeqIO.parse(reference_genome, "fasta"):
    reference_genome.seq.alphabet = IUPAC.unambiguous_dna

reference_genome_annotation = pd.read_csv('/n/data1/hms/dbmi/farhat/bin/work-horse/bin/h37rv_genome_summary.txt', '\t').set_index('name')

####### Function to translate coding DNA sequences ####### 
def translate(gene_id, sequence):

    #find which strand the gene is located on and translate
    strand = reference_genome_annotation.loc[gene_id, 'strand']
    if strand == '+':
        protein_sequence = sequence.translate(table="Bacterial", cds=False)
    elif strand == '-':
        protein_sequence = sequence.reverse_complement().translate(table="Bacterial", cds=False)

    return protein_sequence

####### Load in dictionaries for SNP annotation #######
with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_gene_seq_records.pickle', 'rb') as handle:
    ref_gene_sequences_records = pickle.load(handle)

with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_protein_seq_records.pickle', 'rb') as handle:
    ref_protein_sequences_records = pickle.load(handle)

with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_coord_gene_mapping.pickle', 'rb') as handle:
    ReferencePosition_Gene_mapping = pickle.load(handle)
##########################################################################################################################################################

#Function to determine whether SNPs are Synonymous or Non-Synonymous
##########################################################################################################################################################
def SNP_type(ref_seq_position , alt_allele_i):

    '''
    This function takes as input a reference position on H37Rv located within a 
    gene and an alternate allele and returns whether the base change 
    would correspond to a different Amino Acid sequence that results 
    from translating the DNA sequence into an AA sequence.

    '''
    Syn_NSyn_list = []

    #find the gene that SNP occurs on

    #check list corresponding to H37Rv coordinate to see if there are any genes associated with RefPosition
    if len(ReferencePosition_Gene_mapping[ref_seq_position]) > 0:

        #iterate through all genes that ReferencePosition is mapped to (i.e. SNP might correspond to 2 genes)
        for gene_intergenic_id in ReferencePosition_Gene_mapping[ref_seq_position]:

            #find genomic coordinate of SNP relative to gene (subtract 1 since reference seq starts counting at 1)
            gene_relative_coord = (ref_seq_position - 1) - min( reference_genome_annotation.loc[gene_intergenic_id , 'chromStart'] , reference_genome_annotation.loc[gene_intergenic_id , 'chromEnd'] )

            #translate into protein sequence with the SNP in place if not InDel or intergenic region
            SNP_change = alt_allele_i

            #ALTERNATE allele (is it Syn or NSyn?)
            #get sequence from dictionary of sequences (and convert to mutable object)
            test_gene_sequence = ref_gene_sequences_records[gene_intergenic_id].seq.tomutable()

            #change reference gene sequence by the SNP in the query sequence
            test_gene_sequence[int(gene_relative_coord)] = SNP_change

            #convert back immutable object
            test_gene_sequence = test_gene_sequence.toseq()

            #translate sequence into amino acid seq
            test_protein_sequence = translate(gene_intergenic_id , test_gene_sequence)

            #compare to AA seq of original gene
            if test_protein_sequence == ref_protein_sequences_records[gene_intergenic_id].seq:
                SNP_type = 'S'

            else:
                SNP_type = 'N'

            Syn_NSyn_list.append(SNP_type)

    #if reference position corresponds to 2 overlapping genes, then call SNP an nSNP if at least 1/2 gene changes alters the AA sequence
    if 'N' in Syn_NSyn_list:
        return 'Non-Synonymous'
    else:
        return 'Synonymous'
##########################################################################################################################################################

#Function that checks whether a variant fits the criteria for a general/derived predictive variant
##########################################################################################################################################################
def general_predicative_variant_check(ref_pos_i , variant_type_check , alt_allele_i):

    '''
    This function takes as input the H37Rv reference position for a SNP or an INDEL 
    and checks to see if the variant meets the classification criteria for a derived/general
    predictive variant.
    '''

    derived_labels_for_variant = [] #there may be a variant at this reference position that fits multiple derived labels

    ####### SNPs ####### 
    if variant_type_check == 'SNP':

        general_variant_regions_SNPs_df = variants_of_interest_broad_df[variants_of_interest_broad_df.type == 'snp']

        #check each variant region to see if reference position falls within
        for general_variant_i in general_variant_regions_SNPs_df.index:

            H37Rv_coords_for_region = general_variant_regions_SNPs_df.loc[general_variant_i , 'H37Rv_coords']

            #does the reference position fall within bounds?
            if (H37Rv_coords_for_region[0] + 1) <= ref_pos_i <= (H37Rv_coords_for_region[1] + 1):

                #variant falls within bounds and is NON-CODING
                if general_variant_regions_SNPs_df.loc[general_variant_i , 'coding_noncoding'] != 'coding':

                    derived_labels_for_variant.append( general_variant_regions_SNPs_df.loc[general_variant_i , 'derived_label'] )

                #If CODING region, only store NON-SYNONYMOUS SNPs    
                elif general_variant_regions_SNPs_df.loc[general_variant_i , 'coding_noncoding'] == 'coding':

                    if SNP_type(ref_pos_i , alt_allele_i) == 'Non-Synonymous':

                        derived_labels_for_variant.append( general_variant_regions_SNPs_df.loc[general_variant_i , 'derived_label'] )

    ####### INDELs ####### 
    if variant_type_check == 'INDEL':

        general_variant_regions_INDELs_df = variants_of_interest_broad_df[variants_of_interest_broad_df.type == 'indel']

        #check each variant region to see if reference position falls within
        for general_variant_i in general_variant_regions_INDELs_df.index:

            H37Rv_coords_for_region = general_variant_regions_INDELs_df.loc[general_variant_i , 'H37Rv_coords']

            #does the reference position fall within bounds?
            if (H37Rv_coords_for_region[0] + 1) <= ref_pos_i <= (H37Rv_coords_for_region[1] + 1):

                #variant falls within bounds and is NON-CODING
                if (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] != 'frameshift') and (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] != 'non_frameshift'):

                    derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )

                #If CODING region, check classification as FRAMESHIFT
                elif (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] == 'frameshift') and (float(len(alt_allele_i)) % 3.0 != 0.0):

                    derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )

                #If CODING region, check classification as NON-FRAMESHIFT
                elif (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] == 'non_frameshift') and (float(len(alt_allele_i)) % 3.0 == 0.0):

                    derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )

    #since some variants may be classified into multiple general variant regions with the same derived label (i.e. SNPs in intergenic regions / promoter regions), get rid of duplicates
    derived_labels_for_variant = list(set(derived_labels_for_variant))

    if len(derived_labels_for_variant) > 0:

        return derived_labels_for_variant[0]

    elif len(derived_labels_for_variant) == 0:

        return 'not_predictive_variant'
##########################################################################################################################################################

#Extract Mutations of Interest from VCFs
##########################################################################################################################################################
#directory that stores files for each sequenced isolate
#antibiogram_dir = '/n/data1/hms/dbmi/farhat/avika/antibiogram/feature_vcfs/'
#VCF_path = antibiogram_dir + isolate_tag + '/' + isolate_tag + '_feature.vcf'

#create a DataFrame that stores the variants of interest
#dictionaries to construct DataFrame from later that hold information about each high-quality SNP found for a specific isolate from rollingDB
variant_index_list = []
variant_label_dict = {}
ref_pos_dict = {}
variant_alt_allele_dict = {}
variant_alt_freq_dict = {}
BQ_dict = {}
MQ_dict = {}

#iterate through all mixed calls for all isolates
variant_index = 0

#load in VCF file
vcf_reader = vcf.Reader( open(VCF_path , 'r') )

#iterate through each Variant Call 
for record in vcf_reader:

    ######################################################################
    #check to see if DELETION
    ######################################################################

    # > length of the reference sequence is > 1
    # > there is only 1 alternate allele
    # > the alternate allele is not NONE
    # > length of the alternate sequence = 1

    if (len(record.REF) > 1) and (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) == 1):

        #get relevant stats
        ref_pos_i = record.POS
        deleted_seq_i = record.REF[1:]

        ########################################################################
        #### check to see if this is a predictive deletion from the list of SPECIFIC variants of interest
        ########################################################################
        variant_classified = False

        #reference position matches one of the deletions
        if ref_pos_i in list(DEL_variants_of_interest_df.ref_pos):

            #list of predictive deleted sequences at this reference position
            alt_alleles_at_ref_pos_i = list( DEL_variants_of_interest_df[DEL_variants_of_interest_df.ref_pos == ref_pos_i].seq_deleted )

            #check that the deleted sequence matches one of the deletion variants at this reference position
            if deleted_seq_i in alt_alleles_at_ref_pos_i:

                alt_allele_i = deleted_seq_i

                #check to see if DEL has standard INFO information
                if 'DC' in record.INFO.keys():
                    #calculate allele frequency of deletion
                    del_freq = float(record.INFO['DC']) / float(record.INFO['DP'])
                    BQ = record.INFO['BQ']
                    MQ = record.INFO['MQ']

                else:
                    del_freq = np.nan
                    BQ = np.nan
                    MQ = np.nan

                #store info for predictive variants found into a DF
                variant_index_list.append(variant_index)

                #make sure that reference position & alt allele match
                flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(DEL_variants_of_interest_df.ref_pos == ref_pos_i) , list(DEL_variants_of_interest_df.seq_deleted == alt_allele_i))]
                variant_label_dict[variant_index] = DEL_variants_of_interest_df[flat_annot_label_filter].label.values[0]

                ref_pos_dict[variant_index] = ref_pos_i
                variant_alt_allele_dict[variant_index] = alt_allele_i
                variant_alt_freq_dict[variant_index] = del_freq
                BQ_dict[variant_index] = BQ
                MQ_dict[variant_index] = MQ

                variant_index += 1
                variant_classified = True

        ########################################################################
        #### check to see if this is a predictive deletion from the list of GENERAL variants of interest
        ########################################################################
        if variant_classified == False:

            derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'INDEL' , deleted_seq_i)

            #variant is classified as a general predicitive variant
            if derived_general_variant_label != 'not_predictive_variant':

                alt_allele_i = deleted_seq_i

                #check to see if DEL has standard INFO information
                if 'DC' in record.INFO.keys():
                    #calculate allele frequency of deletion
                    del_freq = float(record.INFO['DC']) / float(record.INFO['DP'])
                    BQ = record.INFO['BQ']
                    MQ = record.INFO['MQ']

                else:
                    del_freq = np.nan
                    BQ = np.nan
                    MQ = np.nan

                #store info for predictive variants found into a DF
                variant_index_list.append(variant_index)

                #make sure that reference position & alt allele match
                variant_label_dict[variant_index] = derived_general_variant_label

                ref_pos_dict[variant_index] = ref_pos_i
                variant_alt_allele_dict[variant_index] = alt_allele_i
                variant_alt_freq_dict[variant_index] = del_freq
                BQ_dict[variant_index] = BQ
                MQ_dict[variant_index] = MQ

                variant_index += 1

    ######################################################################
    #check to see if INSERTION
    ######################################################################

    # > length of the reference sequence = 1
    # > there is only 1 alternate allele
    # > the alternate allele is not NONE
    # > length of the alternate sequence > 1

    elif (len(record.REF) == 1) and (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) > 1):

        #get relevant stats
        ref_pos_i = record.POS
        inserted_seq_i = str(record.ALT[0])[1:]

        ########################################################################
        #### check to see if this is a predictive insertion from the list of SPECIFIC variants of interest
        ########################################################################
        variant_classified = False

        #reference position matches one of the insertions
        if ref_pos_i in list(INS_variants_of_interest_df.ref_pos):

            #list of predictive inserted sequences at this reference position
            alt_alleles_at_ref_pos_i = list( INS_variants_of_interest_df[INS_variants_of_interest_df.ref_pos == ref_pos_i].seq_inserted )

            #check that the inserted sequence matches one of the insertion variants at this reference position
            if inserted_seq_i in alt_alleles_at_ref_pos_i:

                alt_allele_i = inserted_seq_i

                #check to see if INS has standard INFO information
                if 'IC' in record.INFO.keys():
                    #calculate allele frequency of deletion
                    ins_freq = float(record.INFO['IC']) / float(record.INFO['DP'])
                    BQ = record.INFO['BQ']
                    MQ = record.INFO['MQ']

                else:
                    ins_freq = np.nan
                    BQ = np.nan
                    MQ = np.nan

                #store info for predictive variants found into a DF
                variant_index_list.append(variant_index)

                #make sure that reference position & alt allele match
                flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(INS_variants_of_interest_df.ref_pos == ref_pos_i) , list(INS_variants_of_interest_df.seq_inserted == alt_allele_i))]
                variant_label_dict[variant_index] = INS_variants_of_interest_df[flat_annot_label_filter].label.values[0]

                ref_pos_dict[variant_index] = ref_pos_i
                variant_alt_allele_dict[variant_index] = alt_allele_i
                variant_alt_freq_dict[variant_index] = ins_freq
                BQ_dict[variant_index] = BQ
                MQ_dict[variant_index] = MQ

                variant_index += 1
                variant_classified = True

        ################################################
        #### check to see if this is a predictive insertion from the list of GENERAL variants of interest
        ################################################
        if variant_classified == False:

            derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'INDEL' , inserted_seq_i)

            #variant is classified as a general predicitive variant
            if derived_general_variant_label != 'not_predictive_variant':

                alt_allele_i = inserted_seq_i

                #check to see if DEL has standard INFO information
                if 'IC' in record.INFO.keys():
                    #calculate allele frequency of deletion
                    ins_freq = float(record.INFO['IC']) / float(record.INFO['DP'])
                    BQ = record.INFO['BQ']
                    MQ = record.INFO['MQ']

                else:
                    ins_freq = np.nan
                    BQ = np.nan
                    MQ = np.nan

                #store info for predictive variants found into a DF
                variant_index_list.append(variant_index)

                #make sure that reference position & alt allele match
                variant_label_dict[variant_index] = derived_general_variant_label

                ref_pos_dict[variant_index] = ref_pos_i
                variant_alt_allele_dict[variant_index] = alt_allele_i
                variant_alt_freq_dict[variant_index] = ins_freq
                BQ_dict[variant_index] = BQ
                MQ_dict[variant_index] = MQ

                variant_index += 1

    ######################################################################
    #check to see if SNP (single nucleotide change)
    ######################################################################

    # > length of the reference sequence = 1
    # > there is only 1 alternate allele
    # > the alternate allele is either NONE (if alternate allele has few reads supporting it)
    # or 
    # > length of alternate sequence = 1

    elif (len(record.REF) == 1) and (len(record.ALT) == 1) and ( (record.ALT == [None]) or (len(str(record.ALT[0])) == 1) ):

        #get relevant stats
        ref_pos_i = record.POS

        ################################################
        #### check to see if this reference position is included in the list of predictive SNPs from the list of SPECIFIC variants of interest
        ################################################
        variant_classified = False

        #reference position matches one of the SNPs
        if ref_pos_i in list(SNP_variants_of_interest_df.ref_pos):

            # Quality-Percentage - Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus
            QP = record.INFO['QP']

            #boolean for alleles that have at least 1% of reads supporting
            alleles_supported_by_reads_filter = np.array(QP) > 0

            #alleles that are supported by reads
            alleles_supported_by_reads = list( np.array(['A' , 'C' , 'G' , 'T'])[alleles_supported_by_reads_filter] )

            #list of predictive SNPs (base changes) at this reference position
            alt_alleles_at_ref_pos_i = list( SNP_variants_of_interest_df[SNP_variants_of_interest_df.ref_pos == ref_pos_i].base_change )

            #iterate through all alleles (bases) supported at this reference position
            for base_change_i in alleles_supported_by_reads:

                #check that the base change matches one of the SNP variants at this reference position
                if base_change_i in alt_alleles_at_ref_pos_i:

                    alt_allele_i = base_change_i

                    #get the quality-weighted proportion of reads that support the alternate base (if any)

                    #position-base dictionaries (order of output in Pilon)
                    base_order = {0:'A' , 1:'C' , 2:'G', 3:'T'}
                    base_order_r = {'A':0 , 'C':1 , 'G':2 , 'T':3}

                    #position of the Alternate Allele in [A,C,G,T]
                    alt_allele_base_i = base_order_r[alt_allele_i]

                    alt_allele_frequency = float(QP[alt_allele_base_i]) / 100.0 #weighted percentage of reads supporting Alternate Allele

                    #check that at least 5% of the reads support the alternate allele, if so store information for this detected SNP
                    if alt_allele_frequency >= 0.05:

                        BQ = record.INFO['BQ']
                        MQ = record.INFO['MQ']

                        #store info for predictive variants found into a DF
                        variant_index_list.append(variant_index)

                        #make sure that reference position & alt allele match
                        flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(SNP_variants_of_interest_df.ref_pos == ref_pos_i) , list(SNP_variants_of_interest_df.base_change == alt_allele_i))]
                        variant_label_dict[variant_index] = SNP_variants_of_interest_df[flat_annot_label_filter].label.values[0]

                        ref_pos_dict[variant_index] = ref_pos_i
                        variant_alt_allele_dict[variant_index] = alt_allele_i
                        variant_alt_freq_dict[variant_index] = alt_allele_frequency
                        BQ_dict[variant_index] = BQ
                        MQ_dict[variant_index] = MQ

                        variant_index += 1
                        variant_classified = True

        ################################################
        #### check to see if this is a predictive SNP from the list of GENERAL variants of interest
        ################################################
        if variant_classified == False:

            # Quality-Percentage - Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus
            QP = record.INFO['QP']

            #boolean for alleles that have at least 1% of reads supporting
            alleles_supported_by_reads_filter = np.array(QP) > 0

            #alleles that are supported by reads
            alleles_supported_by_reads = list( np.array(['A' , 'C' , 'G' , 'T'])[alleles_supported_by_reads_filter] )

            #list of predictive SNPs (base changes) at this reference position
            alt_alleles_at_ref_pos_i = list( SNP_variants_of_interest_df[SNP_variants_of_interest_df.ref_pos == ref_pos_i].base_change )

            #iterate through all alleles (bases) supported at this reference position
            for base_change_i in alleles_supported_by_reads:

                #only check Alternate Alleles
                if base_change_i != str(record.REF):

                    derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'SNP' , base_change_i)

                    #variant is classified as a general predicitive variant
                    if derived_general_variant_label != 'not_predictive_variant':

                        alt_allele_i = base_change_i

                        #get the quality-weighted proportion of reads that support the alternate base (if any)

                        #position-base dictionaries (order of output in Pilon)
                        base_order = {0:'A' , 1:'C' , 2:'G', 3:'T'}
                        base_order_r = {'A':0 , 'C':1 , 'G':2 , 'T':3}

                        #position of the Alternate Allele in [A,C,G,T]
                        alt_allele_base_i = base_order_r[alt_allele_i]

                        alt_allele_frequency = float(QP[alt_allele_base_i]) / 100.0 #weighted percentage of reads supporting Alternate Allele

                        #check that at least 5% of the reads support the alternate allele, if so store information for this detected SNP
                        if alt_allele_frequency >= 0.05:

                            BQ = record.INFO['BQ']
                            MQ = record.INFO['MQ']

                            #store info for predictive variants found into a DF
                            variant_index_list.append(variant_index)

                            #make sure that reference position & alt allele match
                            variant_label_dict[variant_index] = derived_general_variant_label

                            ref_pos_dict[variant_index] = ref_pos_i
                            variant_alt_allele_dict[variant_index] = alt_allele_i
                            variant_alt_freq_dict[variant_index] = alt_allele_frequency
                            BQ_dict[variant_index] = BQ
                            MQ_dict[variant_index] = MQ

                            variant_index += 1

    ######################################################################
    #check to see if SNP (multiple nucleotide change)
    ######################################################################

    # > there is only 1 alternate allele
    # > the alternate allele is not NONE
    # > length of the alternate sequence > 1

    elif (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) > 1):

        #get relevant stats
        ref_pos_i = record.POS
        changed_seq_i = str(record.ALT[0])

        #check to see if this is a predictive MSNP from the list of variants of interest

        #reference position (+/- 1) matches one of the MSNP ref positions
        if ref_pos_i in list(MSNP_variants_of_interest_df.ref_pos):
            pass

        elif (ref_pos_i - 1) in list(MSNP_variants_of_interest_df.ref_pos):
            ref_pos_i = ref_pos_i - 1

        elif (ref_pos_i + 1) in list(MSNP_variants_of_interest_df.ref_pos):
            ref_pos_i = ref_pos_i + 1

        else:
            ref_pos_i = 'NOT FOUND' #reference position +/- 1 for this variant doesn't match any MSNPs

        #if the reference position +/- 1 matches any of the MSNPs, check to see if alternate allele sequences match
        if ref_pos_i != 'NOT FOUND':

            #list of MSNPs (seq changes) at this reference position
            alt_alleles_at_ref_pos_i = list( MSNP_variants_of_interest_df[MSNP_variants_of_interest_df.ref_pos == ref_pos_i].seq_change )

            #check that the sequence change matches one of the MSNP variants at this reference position
            if changed_seq_i in alt_alleles_at_ref_pos_i:

                alt_allele_i = changed_seq_i

                #check to see if MSNP has standard INFO information
                if 'AF' in record.INFO.keys():
                    #calculate allele frequency of multi-nucleotide SNP
                    MSNP_freq = float(record.INFO['AF'][0])
                    BQ = record.INFO['BQ']
                    MQ = record.INFO['MQ']

                else:
                    MSNP_freq = np.nan
                    BQ = np.nan
                    MQ = np.nan

                #store info for predictive variants found into a DF
                variant_index_list.append(variant_index)

                #make sure that reference position & alt allele match
                flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(MSNP_variants_of_interest_df.ref_pos == ref_pos_i) , list(MSNP_variants_of_interest_df.seq_change == alt_allele_i))]
                variant_label_dict[variant_index] = MSNP_variants_of_interest_df[flat_annot_label_filter].label.values[0]

                ref_pos_dict[variant_index] = ref_pos_i
                variant_alt_allele_dict[variant_index] = alt_allele_i
                variant_alt_freq_dict[variant_index] = MSNP_freq
                BQ_dict[variant_index] = BQ
                MQ_dict[variant_index] = MQ

                variant_index += 1

#construct DataFrame from dictionaries
isolate_variant_df = pd.DataFrame( index = variant_index_list )
isolate_variant_df['flat_annot_label'] = pd.Series(variant_label_dict)
isolate_variant_df['ref_pos'] = pd.Series(ref_pos_dict)
isolate_variant_df['alt_allele'] = pd.Series(variant_alt_allele_dict)
isolate_variant_df['alt_allele_freq'] = pd.Series(variant_alt_freq_dict)
isolate_variant_df['base_quality'] = pd.Series(BQ_dict)
isolate_variant_df['map_quality'] = pd.Series(MQ_dict)

#drop duplicate rows & re-index
isolate_variant_df = isolate_variant_df[ [not duplicated for duplicated in list(isolate_variant_df.duplicated(subset = 'flat_annot_label' , keep = 'first'))] ]
isolate_variant_df.reset_index(inplace = True , drop = True)

#check to make sure that there was at least 1 variant that was picked up in VCF
if np.shape(isolate_variant_df)[0] > 0:

    #classify each variant with a derived variant label from Michael's functions
    ##################################################################################################################################
    #get the flat annotator labels for all mutations that were picked up within the VCF
    flat_annotator_labels_for_variants_in_VCF = list( isolate_variant_df.flat_annot_label )

    #use Michael's functions to create a dictionary of mutations: keys - derived mutation labels , values - flat annotator labels for each mutation picked up in VCF
    derived_variant_label_dict = get_final_dict(get_gene_dict(flat_annotator_labels_for_variants_in_VCF))

    #reverse the dictionary so that: keys - flat annotator labels , values - derived mutation labels
    flat_annot_variant_label_dict = {}

    #iterate through derived variants
    for derived_variant_label in derived_variant_label_dict.keys():

        #for each derived variant, iterate through variants from VCF that classify
        for flat_annot_variant_label in derived_variant_label_dict[derived_variant_label]:

            #append flat annot label <-> derived label mapping to dict
            flat_annot_variant_label_dict[flat_annot_variant_label] = derived_variant_label

    #create a column in variant DataFrame to store the derived variant label
    derived_variant_label_list = []
    for flat_annot_label in flat_annotator_labels_for_variants_in_VCF:

        #get derived variant label, if it exists
        try:
            derived_variant_label_list.append( flat_annot_variant_label_dict[flat_annot_label] )

        #store placeholder if derived variant label doesn't exist (just in case...)
        except KeyError:
            derived_variant_label_list.append( 'no_derived_variant_label' )

    #append derived variant label to variant DataFrame for variants picked up from VCF
    isolate_variant_df['derived_label'] = derived_variant_label_list
    ##################################################################################################################################

#store as a CSV file in the same directory as the VCF file
## isolate_variant_df.to_csv('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/feature_CSV_files/' + isolate_tag + '_verified_features.csv')
isolate_variant_df.to_csv( OUT_directory '/' + isolate_tag + '_verified_features.csv')
##########################################################################################################################################################
avdixit commented 5 years ago

Files that are needed for above script (from Roger) H37Rv_annotation.zip predictive_variant_dataframes.zip dicts_for_SNP_annotation.zip

avdixit commented 5 years ago
##This R script takes the csv files output from the python script above and creates a genotype matrix to be used as an input for the WDNN model (author: Avika)

library(data.table)
temp <- list.files(path = "feature_CSV_files", pattern="*_verified_features.csv", recursive = TRUE)

myfiles <- lapply(temp, function(x) fread(file = paste("feature_CSV_files", x, sep = "/"), stringsAsFactors = FALSE))

feature_list <- readLines("feature_list_222.txt")

feature_matrix <- matrix(data = "0", nrow = length(myfiles), ncol = length(feature_list))

for(i in 1:length(myfiles)) {
  df <- data.frame(myfiles[[i]])
  for (j in 1:length(feature_list)) {
    if (!is.na(df[1,1])) {
      for(k in 1:nrow(df)) {
        if(feature_list[j] == df$flat_annot_label[k]){
          feature_matrix[i,j] <- df[k, "alt_allele_freq"] 
        } 
      }
    } 
  }
}

isolate_names <- unlist(lapply(strsplit(temp, "_"), `[[`, 1))
writeLines(isolate_names, "isolate_names.txt")
write.csv(feature_matrix, "feature_matrix_222.csv", row.names = FALSE)