Closed mahafarhat closed 3 years ago
I've created a repository and upgraded the script to work with the website:
The tensorflow requirement is now python 3.6 ONLY
which means this code needs converting.
Can we test it using python 3.6, it may be ok.
tensorflow package has all sorts of issues. I managed to get it working with tensorflow==0.12.1 which is the oldest possible version available. Anything newer would need the models to be regenerated.
After Megapipe addition, second task for David
Avika has written some R code to extract a matrix for resistance prediction from vcf files.
##This script takes vcf files and identifies presence of predictive variants (author: Roger)
##########################################################################################################################################################
#Important Packages
import vcf
import os
import pandas as pd
import numpy as np
import ast
import time
import sys
import pickle
import shutil
import Bio
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import SeqIO
from StringIO import StringIO
from Bio import AlignIO
from Bio.Align import AlignInfo
from Bio.Seq import MutableSeq
##########################################################################################################################################################
##########################################################################################################################################################
#INPUTS
isolate_tag = sys.argv[1] #INPUT 1 - sample isolate tag
VCF_path = sys.argv[2] #INPUT 2 - path to VCF file to extract variants from
OUT_directory = sys.argv[3] #INPUT 3 - output directory to save CSV file
##########################################################################################################################################################
#Load DataFrames that contain information for specific variants
##########################################################################################################################################################
DEL_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_deletion_variants.pkl')
INS_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_insertion_variants.pkl')
SNP_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_SNP_variants.pkl')
MSNP_variants_of_interest_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/specific_predictive_MSNP_variants.pkl')
##########################################################################################################################################################
#Load DataFrame for general variant regions
##########################################################################################################################################################
variants_of_interest_broad_df = pd.read_pickle('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/CSV_files/predictive_variant_dataframes/general_predictive_variants.pkl')
##########################################################################################################################################################
#Use Michael's helper functions to broadly classify any specific mutations detected into a derived mutation category after searching VCFs
##########################################################################################################################################################
# Gene names
genes = ['ahpC', 'alr', 'ddl', 'embA', 'embB', 'embC', 'ethA', 'gid', 'gyrA', 'gyrB', 'inhA', 'iniA',
'iniB', 'iniC', 'kasA', 'katG', 'murA-rrs', 'fabG1', 'ndh', 'pncA', 'rpoB', 'rpsL', 'rrl',
'rrs', 'thyA', 'tlyA', 'gyrB-gyrA', 'gyrA-Rv0007', 'iniB-iniA-iniC', 'iniB-iniA', 'iniC-lpqJ',
'rpoB-rpoC', 'fabG1-inhA', 'rrs-rrl', 'rrl-rrf', 'inhA-hemZ', 'katG-furA', 'kasA-kasB',
'ahpC-ahpD', 'dfrA-thyA', 'alr-Rv3792', 'embA-embB', 'embB-Rv3796', 'menG-ethA', 'rpsA', 'eis',
"oxyR'", 'acpM']
# Get gene associated with particular mutation
def get_gene(x):
splitted = x.split("_")
if list(set(splitted).intersection(genes)) == []:
if 'ndhA' in splitted or 'mfd' in splitted or 'whiB6':
return None
#return None
raise Exception(splitted)
type = '-'.join(list(set(splitted).intersection(genes)))
return '-'.join(sorted(type.split('-')))
# Get gene dictionary
def get_gene_dict(derived_names):
gene_dict = {}
for snp in derived_names:
if not get_gene(snp):
continue
gene_dict.setdefault(get_gene(snp), []).append(snp)
return gene_dict
# Get dictionary by mutation location and type of mutation
def get_final_dict(gene_dict):
final_dict = {}
for gene, muts in gene_dict.iteritems():
for mut in muts:
split = mut.split("_")
if 'P' in split or 'I' in split:
if 'DEL' in split or 'INS' in split:
final_dict.setdefault(gene + '_NC_indel', []).append(mut)
# is noncoding indel
else:
final_dict.setdefault(gene + '_NC_snp', []).append(mut)
# is noncoding snp
elif 'F' in split or 'CF' in split:
final_dict.setdefault(gene + '_F_indel', []).append(mut)
# is coding frameshift
#ROGER's EDIT: have to make a modification for rrs & rrl ribosomal proteins since 'N' within these genes doesn't imply they are indel variants
elif ('CI' in split) or ('N' in split and gene != 'rrl' and gene != 'rrs') or ('NF' in split):
final_dict.setdefault(gene + '_NF_indel', []).append(mut)
# is coding not frameshift
else:
final_dict.setdefault(gene + '_C_snp', []).append(mut)
# is coding snp
return final_dict
##########################################################################################################################################################
#Relevant Information for H37Rv sequence and SNP functional annotation
##########################################################################################################################################################
####### Collect all DNA and Amino Acid sequences corresponding to genes on H37Rv #######
#load reference genome and reference annotation
reference_genome = '/n/data1/hms/dbmi/farhat/bin/work-horse/bin/h37rv.fasta'
for reference_genome in SeqIO.parse(reference_genome, "fasta"):
reference_genome.seq.alphabet = IUPAC.unambiguous_dna
reference_genome_annotation = pd.read_csv('/n/data1/hms/dbmi/farhat/bin/work-horse/bin/h37rv_genome_summary.txt', '\t').set_index('name')
####### Function to translate coding DNA sequences #######
def translate(gene_id, sequence):
#find which strand the gene is located on and translate
strand = reference_genome_annotation.loc[gene_id, 'strand']
if strand == '+':
protein_sequence = sequence.translate(table="Bacterial", cds=False)
elif strand == '-':
protein_sequence = sequence.reverse_complement().translate(table="Bacterial", cds=False)
return protein_sequence
####### Load in dictionaries for SNP annotation #######
with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_gene_seq_records.pickle', 'rb') as handle:
ref_gene_sequences_records = pickle.load(handle)
with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_protein_seq_records.pickle', 'rb') as handle:
ref_protein_sequences_records = pickle.load(handle)
with open('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/pickled_objects/dicts_for_SNP_annotation/H37Rv_coord_gene_mapping.pickle', 'rb') as handle:
ReferencePosition_Gene_mapping = pickle.load(handle)
##########################################################################################################################################################
#Function to determine whether SNPs are Synonymous or Non-Synonymous
##########################################################################################################################################################
def SNP_type(ref_seq_position , alt_allele_i):
'''
This function takes as input a reference position on H37Rv located within a
gene and an alternate allele and returns whether the base change
would correspond to a different Amino Acid sequence that results
from translating the DNA sequence into an AA sequence.
'''
Syn_NSyn_list = []
#find the gene that SNP occurs on
#check list corresponding to H37Rv coordinate to see if there are any genes associated with RefPosition
if len(ReferencePosition_Gene_mapping[ref_seq_position]) > 0:
#iterate through all genes that ReferencePosition is mapped to (i.e. SNP might correspond to 2 genes)
for gene_intergenic_id in ReferencePosition_Gene_mapping[ref_seq_position]:
#find genomic coordinate of SNP relative to gene (subtract 1 since reference seq starts counting at 1)
gene_relative_coord = (ref_seq_position - 1) - min( reference_genome_annotation.loc[gene_intergenic_id , 'chromStart'] , reference_genome_annotation.loc[gene_intergenic_id , 'chromEnd'] )
#translate into protein sequence with the SNP in place if not InDel or intergenic region
SNP_change = alt_allele_i
#ALTERNATE allele (is it Syn or NSyn?)
#get sequence from dictionary of sequences (and convert to mutable object)
test_gene_sequence = ref_gene_sequences_records[gene_intergenic_id].seq.tomutable()
#change reference gene sequence by the SNP in the query sequence
test_gene_sequence[int(gene_relative_coord)] = SNP_change
#convert back immutable object
test_gene_sequence = test_gene_sequence.toseq()
#translate sequence into amino acid seq
test_protein_sequence = translate(gene_intergenic_id , test_gene_sequence)
#compare to AA seq of original gene
if test_protein_sequence == ref_protein_sequences_records[gene_intergenic_id].seq:
SNP_type = 'S'
else:
SNP_type = 'N'
Syn_NSyn_list.append(SNP_type)
#if reference position corresponds to 2 overlapping genes, then call SNP an nSNP if at least 1/2 gene changes alters the AA sequence
if 'N' in Syn_NSyn_list:
return 'Non-Synonymous'
else:
return 'Synonymous'
##########################################################################################################################################################
#Function that checks whether a variant fits the criteria for a general/derived predictive variant
##########################################################################################################################################################
def general_predicative_variant_check(ref_pos_i , variant_type_check , alt_allele_i):
'''
This function takes as input the H37Rv reference position for a SNP or an INDEL
and checks to see if the variant meets the classification criteria for a derived/general
predictive variant.
'''
derived_labels_for_variant = [] #there may be a variant at this reference position that fits multiple derived labels
####### SNPs #######
if variant_type_check == 'SNP':
general_variant_regions_SNPs_df = variants_of_interest_broad_df[variants_of_interest_broad_df.type == 'snp']
#check each variant region to see if reference position falls within
for general_variant_i in general_variant_regions_SNPs_df.index:
H37Rv_coords_for_region = general_variant_regions_SNPs_df.loc[general_variant_i , 'H37Rv_coords']
#does the reference position fall within bounds?
if (H37Rv_coords_for_region[0] + 1) <= ref_pos_i <= (H37Rv_coords_for_region[1] + 1):
#variant falls within bounds and is NON-CODING
if general_variant_regions_SNPs_df.loc[general_variant_i , 'coding_noncoding'] != 'coding':
derived_labels_for_variant.append( general_variant_regions_SNPs_df.loc[general_variant_i , 'derived_label'] )
#If CODING region, only store NON-SYNONYMOUS SNPs
elif general_variant_regions_SNPs_df.loc[general_variant_i , 'coding_noncoding'] == 'coding':
if SNP_type(ref_pos_i , alt_allele_i) == 'Non-Synonymous':
derived_labels_for_variant.append( general_variant_regions_SNPs_df.loc[general_variant_i , 'derived_label'] )
####### INDELs #######
if variant_type_check == 'INDEL':
general_variant_regions_INDELs_df = variants_of_interest_broad_df[variants_of_interest_broad_df.type == 'indel']
#check each variant region to see if reference position falls within
for general_variant_i in general_variant_regions_INDELs_df.index:
H37Rv_coords_for_region = general_variant_regions_INDELs_df.loc[general_variant_i , 'H37Rv_coords']
#does the reference position fall within bounds?
if (H37Rv_coords_for_region[0] + 1) <= ref_pos_i <= (H37Rv_coords_for_region[1] + 1):
#variant falls within bounds and is NON-CODING
if (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] != 'frameshift') and (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] != 'non_frameshift'):
derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )
#If CODING region, check classification as FRAMESHIFT
elif (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] == 'frameshift') and (float(len(alt_allele_i)) % 3.0 != 0.0):
derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )
#If CODING region, check classification as NON-FRAMESHIFT
elif (general_variant_regions_INDELs_df.loc[general_variant_i , 'coding_noncoding'] == 'non_frameshift') and (float(len(alt_allele_i)) % 3.0 == 0.0):
derived_labels_for_variant.append( general_variant_regions_INDELs_df.loc[general_variant_i , 'derived_label'] )
#since some variants may be classified into multiple general variant regions with the same derived label (i.e. SNPs in intergenic regions / promoter regions), get rid of duplicates
derived_labels_for_variant = list(set(derived_labels_for_variant))
if len(derived_labels_for_variant) > 0:
return derived_labels_for_variant[0]
elif len(derived_labels_for_variant) == 0:
return 'not_predictive_variant'
##########################################################################################################################################################
#Extract Mutations of Interest from VCFs
##########################################################################################################################################################
#directory that stores files for each sequenced isolate
#antibiogram_dir = '/n/data1/hms/dbmi/farhat/avika/antibiogram/feature_vcfs/'
#VCF_path = antibiogram_dir + isolate_tag + '/' + isolate_tag + '_feature.vcf'
#create a DataFrame that stores the variants of interest
#dictionaries to construct DataFrame from later that hold information about each high-quality SNP found for a specific isolate from rollingDB
variant_index_list = []
variant_label_dict = {}
ref_pos_dict = {}
variant_alt_allele_dict = {}
variant_alt_freq_dict = {}
BQ_dict = {}
MQ_dict = {}
#iterate through all mixed calls for all isolates
variant_index = 0
#load in VCF file
vcf_reader = vcf.Reader( open(VCF_path , 'r') )
#iterate through each Variant Call
for record in vcf_reader:
######################################################################
#check to see if DELETION
######################################################################
# > length of the reference sequence is > 1
# > there is only 1 alternate allele
# > the alternate allele is not NONE
# > length of the alternate sequence = 1
if (len(record.REF) > 1) and (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) == 1):
#get relevant stats
ref_pos_i = record.POS
deleted_seq_i = record.REF[1:]
########################################################################
#### check to see if this is a predictive deletion from the list of SPECIFIC variants of interest
########################################################################
variant_classified = False
#reference position matches one of the deletions
if ref_pos_i in list(DEL_variants_of_interest_df.ref_pos):
#list of predictive deleted sequences at this reference position
alt_alleles_at_ref_pos_i = list( DEL_variants_of_interest_df[DEL_variants_of_interest_df.ref_pos == ref_pos_i].seq_deleted )
#check that the deleted sequence matches one of the deletion variants at this reference position
if deleted_seq_i in alt_alleles_at_ref_pos_i:
alt_allele_i = deleted_seq_i
#check to see if DEL has standard INFO information
if 'DC' in record.INFO.keys():
#calculate allele frequency of deletion
del_freq = float(record.INFO['DC']) / float(record.INFO['DP'])
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
else:
del_freq = np.nan
BQ = np.nan
MQ = np.nan
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(DEL_variants_of_interest_df.ref_pos == ref_pos_i) , list(DEL_variants_of_interest_df.seq_deleted == alt_allele_i))]
variant_label_dict[variant_index] = DEL_variants_of_interest_df[flat_annot_label_filter].label.values[0]
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = del_freq
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
variant_classified = True
########################################################################
#### check to see if this is a predictive deletion from the list of GENERAL variants of interest
########################################################################
if variant_classified == False:
derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'INDEL' , deleted_seq_i)
#variant is classified as a general predicitive variant
if derived_general_variant_label != 'not_predictive_variant':
alt_allele_i = deleted_seq_i
#check to see if DEL has standard INFO information
if 'DC' in record.INFO.keys():
#calculate allele frequency of deletion
del_freq = float(record.INFO['DC']) / float(record.INFO['DP'])
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
else:
del_freq = np.nan
BQ = np.nan
MQ = np.nan
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
variant_label_dict[variant_index] = derived_general_variant_label
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = del_freq
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
######################################################################
#check to see if INSERTION
######################################################################
# > length of the reference sequence = 1
# > there is only 1 alternate allele
# > the alternate allele is not NONE
# > length of the alternate sequence > 1
elif (len(record.REF) == 1) and (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) > 1):
#get relevant stats
ref_pos_i = record.POS
inserted_seq_i = str(record.ALT[0])[1:]
########################################################################
#### check to see if this is a predictive insertion from the list of SPECIFIC variants of interest
########################################################################
variant_classified = False
#reference position matches one of the insertions
if ref_pos_i in list(INS_variants_of_interest_df.ref_pos):
#list of predictive inserted sequences at this reference position
alt_alleles_at_ref_pos_i = list( INS_variants_of_interest_df[INS_variants_of_interest_df.ref_pos == ref_pos_i].seq_inserted )
#check that the inserted sequence matches one of the insertion variants at this reference position
if inserted_seq_i in alt_alleles_at_ref_pos_i:
alt_allele_i = inserted_seq_i
#check to see if INS has standard INFO information
if 'IC' in record.INFO.keys():
#calculate allele frequency of deletion
ins_freq = float(record.INFO['IC']) / float(record.INFO['DP'])
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
else:
ins_freq = np.nan
BQ = np.nan
MQ = np.nan
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(INS_variants_of_interest_df.ref_pos == ref_pos_i) , list(INS_variants_of_interest_df.seq_inserted == alt_allele_i))]
variant_label_dict[variant_index] = INS_variants_of_interest_df[flat_annot_label_filter].label.values[0]
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = ins_freq
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
variant_classified = True
################################################
#### check to see if this is a predictive insertion from the list of GENERAL variants of interest
################################################
if variant_classified == False:
derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'INDEL' , inserted_seq_i)
#variant is classified as a general predicitive variant
if derived_general_variant_label != 'not_predictive_variant':
alt_allele_i = inserted_seq_i
#check to see if DEL has standard INFO information
if 'IC' in record.INFO.keys():
#calculate allele frequency of deletion
ins_freq = float(record.INFO['IC']) / float(record.INFO['DP'])
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
else:
ins_freq = np.nan
BQ = np.nan
MQ = np.nan
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
variant_label_dict[variant_index] = derived_general_variant_label
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = ins_freq
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
######################################################################
#check to see if SNP (single nucleotide change)
######################################################################
# > length of the reference sequence = 1
# > there is only 1 alternate allele
# > the alternate allele is either NONE (if alternate allele has few reads supporting it)
# or
# > length of alternate sequence = 1
elif (len(record.REF) == 1) and (len(record.ALT) == 1) and ( (record.ALT == [None]) or (len(str(record.ALT[0])) == 1) ):
#get relevant stats
ref_pos_i = record.POS
################################################
#### check to see if this reference position is included in the list of predictive SNPs from the list of SPECIFIC variants of interest
################################################
variant_classified = False
#reference position matches one of the SNPs
if ref_pos_i in list(SNP_variants_of_interest_df.ref_pos):
# Quality-Percentage - Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus
QP = record.INFO['QP']
#boolean for alleles that have at least 1% of reads supporting
alleles_supported_by_reads_filter = np.array(QP) > 0
#alleles that are supported by reads
alleles_supported_by_reads = list( np.array(['A' , 'C' , 'G' , 'T'])[alleles_supported_by_reads_filter] )
#list of predictive SNPs (base changes) at this reference position
alt_alleles_at_ref_pos_i = list( SNP_variants_of_interest_df[SNP_variants_of_interest_df.ref_pos == ref_pos_i].base_change )
#iterate through all alleles (bases) supported at this reference position
for base_change_i in alleles_supported_by_reads:
#check that the base change matches one of the SNP variants at this reference position
if base_change_i in alt_alleles_at_ref_pos_i:
alt_allele_i = base_change_i
#get the quality-weighted proportion of reads that support the alternate base (if any)
#position-base dictionaries (order of output in Pilon)
base_order = {0:'A' , 1:'C' , 2:'G', 3:'T'}
base_order_r = {'A':0 , 'C':1 , 'G':2 , 'T':3}
#position of the Alternate Allele in [A,C,G,T]
alt_allele_base_i = base_order_r[alt_allele_i]
alt_allele_frequency = float(QP[alt_allele_base_i]) / 100.0 #weighted percentage of reads supporting Alternate Allele
#check that at least 5% of the reads support the alternate allele, if so store information for this detected SNP
if alt_allele_frequency >= 0.05:
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(SNP_variants_of_interest_df.ref_pos == ref_pos_i) , list(SNP_variants_of_interest_df.base_change == alt_allele_i))]
variant_label_dict[variant_index] = SNP_variants_of_interest_df[flat_annot_label_filter].label.values[0]
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = alt_allele_frequency
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
variant_classified = True
################################################
#### check to see if this is a predictive SNP from the list of GENERAL variants of interest
################################################
if variant_classified == False:
# Quality-Percentage - Percentage of As, Cs, Gs, Ts weighted by Q & MQ at locus
QP = record.INFO['QP']
#boolean for alleles that have at least 1% of reads supporting
alleles_supported_by_reads_filter = np.array(QP) > 0
#alleles that are supported by reads
alleles_supported_by_reads = list( np.array(['A' , 'C' , 'G' , 'T'])[alleles_supported_by_reads_filter] )
#list of predictive SNPs (base changes) at this reference position
alt_alleles_at_ref_pos_i = list( SNP_variants_of_interest_df[SNP_variants_of_interest_df.ref_pos == ref_pos_i].base_change )
#iterate through all alleles (bases) supported at this reference position
for base_change_i in alleles_supported_by_reads:
#only check Alternate Alleles
if base_change_i != str(record.REF):
derived_general_variant_label = general_predicative_variant_check(ref_pos_i , 'SNP' , base_change_i)
#variant is classified as a general predicitive variant
if derived_general_variant_label != 'not_predictive_variant':
alt_allele_i = base_change_i
#get the quality-weighted proportion of reads that support the alternate base (if any)
#position-base dictionaries (order of output in Pilon)
base_order = {0:'A' , 1:'C' , 2:'G', 3:'T'}
base_order_r = {'A':0 , 'C':1 , 'G':2 , 'T':3}
#position of the Alternate Allele in [A,C,G,T]
alt_allele_base_i = base_order_r[alt_allele_i]
alt_allele_frequency = float(QP[alt_allele_base_i]) / 100.0 #weighted percentage of reads supporting Alternate Allele
#check that at least 5% of the reads support the alternate allele, if so store information for this detected SNP
if alt_allele_frequency >= 0.05:
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
variant_label_dict[variant_index] = derived_general_variant_label
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = alt_allele_frequency
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
######################################################################
#check to see if SNP (multiple nucleotide change)
######################################################################
# > there is only 1 alternate allele
# > the alternate allele is not NONE
# > length of the alternate sequence > 1
elif (len(record.ALT) == 1) and (record.ALT != [None]) and (len(str(record.ALT[0])) > 1):
#get relevant stats
ref_pos_i = record.POS
changed_seq_i = str(record.ALT[0])
#check to see if this is a predictive MSNP from the list of variants of interest
#reference position (+/- 1) matches one of the MSNP ref positions
if ref_pos_i in list(MSNP_variants_of_interest_df.ref_pos):
pass
elif (ref_pos_i - 1) in list(MSNP_variants_of_interest_df.ref_pos):
ref_pos_i = ref_pos_i - 1
elif (ref_pos_i + 1) in list(MSNP_variants_of_interest_df.ref_pos):
ref_pos_i = ref_pos_i + 1
else:
ref_pos_i = 'NOT FOUND' #reference position +/- 1 for this variant doesn't match any MSNPs
#if the reference position +/- 1 matches any of the MSNPs, check to see if alternate allele sequences match
if ref_pos_i != 'NOT FOUND':
#list of MSNPs (seq changes) at this reference position
alt_alleles_at_ref_pos_i = list( MSNP_variants_of_interest_df[MSNP_variants_of_interest_df.ref_pos == ref_pos_i].seq_change )
#check that the sequence change matches one of the MSNP variants at this reference position
if changed_seq_i in alt_alleles_at_ref_pos_i:
alt_allele_i = changed_seq_i
#check to see if MSNP has standard INFO information
if 'AF' in record.INFO.keys():
#calculate allele frequency of multi-nucleotide SNP
MSNP_freq = float(record.INFO['AF'][0])
BQ = record.INFO['BQ']
MQ = record.INFO['MQ']
else:
MSNP_freq = np.nan
BQ = np.nan
MQ = np.nan
#store info for predictive variants found into a DF
variant_index_list.append(variant_index)
#make sure that reference position & alt allele match
flat_annot_label_filter = [(ref_pos_for_variant_i and alt_allele_for_variant_i) for ref_pos_for_variant_i,alt_allele_for_variant_i in zip(list(MSNP_variants_of_interest_df.ref_pos == ref_pos_i) , list(MSNP_variants_of_interest_df.seq_change == alt_allele_i))]
variant_label_dict[variant_index] = MSNP_variants_of_interest_df[flat_annot_label_filter].label.values[0]
ref_pos_dict[variant_index] = ref_pos_i
variant_alt_allele_dict[variant_index] = alt_allele_i
variant_alt_freq_dict[variant_index] = MSNP_freq
BQ_dict[variant_index] = BQ
MQ_dict[variant_index] = MQ
variant_index += 1
#construct DataFrame from dictionaries
isolate_variant_df = pd.DataFrame( index = variant_index_list )
isolate_variant_df['flat_annot_label'] = pd.Series(variant_label_dict)
isolate_variant_df['ref_pos'] = pd.Series(ref_pos_dict)
isolate_variant_df['alt_allele'] = pd.Series(variant_alt_allele_dict)
isolate_variant_df['alt_allele_freq'] = pd.Series(variant_alt_freq_dict)
isolate_variant_df['base_quality'] = pd.Series(BQ_dict)
isolate_variant_df['map_quality'] = pd.Series(MQ_dict)
#drop duplicate rows & re-index
isolate_variant_df = isolate_variant_df[ [not duplicated for duplicated in list(isolate_variant_df.duplicated(subset = 'flat_annot_label' , keep = 'first'))] ]
isolate_variant_df.reset_index(inplace = True , drop = True)
#check to make sure that there was at least 1 variant that was picked up in VCF
if np.shape(isolate_variant_df)[0] > 0:
#classify each variant with a derived variant label from Michael's functions
##################################################################################################################################
#get the flat annotator labels for all mutations that were picked up within the VCF
flat_annotator_labels_for_variants_in_VCF = list( isolate_variant_df.flat_annot_label )
#use Michael's functions to create a dictionary of mutations: keys - derived mutation labels , values - flat annotator labels for each mutation picked up in VCF
derived_variant_label_dict = get_final_dict(get_gene_dict(flat_annotator_labels_for_variants_in_VCF))
#reverse the dictionary so that: keys - flat annotator labels , values - derived mutation labels
flat_annot_variant_label_dict = {}
#iterate through derived variants
for derived_variant_label in derived_variant_label_dict.keys():
#for each derived variant, iterate through variants from VCF that classify
for flat_annot_variant_label in derived_variant_label_dict[derived_variant_label]:
#append flat annot label <-> derived label mapping to dict
flat_annot_variant_label_dict[flat_annot_variant_label] = derived_variant_label
#create a column in variant DataFrame to store the derived variant label
derived_variant_label_list = []
for flat_annot_label in flat_annotator_labels_for_variants_in_VCF:
#get derived variant label, if it exists
try:
derived_variant_label_list.append( flat_annot_variant_label_dict[flat_annot_label] )
#store placeholder if derived variant label doesn't exist (just in case...)
except KeyError:
derived_variant_label_list.append( 'no_derived_variant_label' )
#append derived variant label to variant DataFrame for variants picked up from VCF
isolate_variant_df['derived_label'] = derived_variant_label_list
##################################################################################################################################
#store as a CSV file in the same directory as the VCF file
## isolate_variant_df.to_csv('/n/data1/hms/dbmi/farhat/Roger/antibiogram_project/feature_CSV_files/' + isolate_tag + '_verified_features.csv')
isolate_variant_df.to_csv( OUT_directory '/' + isolate_tag + '_verified_features.csv')
##########################################################################################################################################################
Files that are needed for above script (from Roger) H37Rv_annotation.zip predictive_variant_dataframes.zip dicts_for_SNP_annotation.zip
##This R script takes the csv files output from the python script above and creates a genotype matrix to be used as an input for the WDNN model (author: Avika)
library(data.table)
temp <- list.files(path = "feature_CSV_files", pattern="*_verified_features.csv", recursive = TRUE)
myfiles <- lapply(temp, function(x) fread(file = paste("feature_CSV_files", x, sep = "/"), stringsAsFactors = FALSE))
feature_list <- readLines("feature_list_222.txt")
feature_matrix <- matrix(data = "0", nrow = length(myfiles), ncol = length(feature_list))
for(i in 1:length(myfiles)) {
df <- data.frame(myfiles[[i]])
for (j in 1:length(feature_list)) {
if (!is.na(df[1,1])) {
for(k in 1:nrow(df)) {
if(feature_list[j] == df$flat_annot_label[k]){
feature_matrix[i,j] <- df[k, "alt_allele_freq"]
}
}
}
}
}
isolate_names <- unlist(lapply(strsplit(temp, "_"), `[[`, 1))
writeLines(isolate_names, "isolate_names.txt")
write.csv(feature_matrix, "feature_matrix_222.csv", row.names = FALSE)
Python script and sample data from prediction from genotypic data (that can be obtained from the vcf) to a prediction in probabilities that can be displayed on the predict page. To replace TBpredict.R portion of the pipeline
This script requires Keras 1.2.0 and TensorFlow 0.10.0
We need to likely develop new code to extract relevant format from VCF
We need to likely develop new code to extract relevant output data into json format for display on the webtool in the API.
saved_model_0319.zip