Open MikeDacre opened 8 years ago
Here's a function I wrote for the purpose of reading GFF files in as pandas dataframes:
def gff_to_df(gff, up=True):
"""Return a pandas dataframe containing all of the data in a GFF file.
See http://uswest.ensembl.org/info/website/upload/gff.html for GFF specifications
Notes
-----
The first eight column headers are standard:
['CHROM', 'SOURCE', 'FEATURE', 'START', 'END', 'SCORE', 'STRAND', 'FRAME']
Additional columns will be created for each value found in the ninth, attribute column.
If up = True (default), the column headers will all be uppercase versions of the attribute
names, e.g.:
['GENE_ID', 'TRANSCRIPT_ID', 'EXON']
Lines that do not have attributes corresponding to a particular column are assigned None
REQUIREMENTS:
import pandas as pd
from collections import defaultdict
"""
#Build dictionary
df_dict = defaultdict(list)
opt_fields = []
lines = 0
with open(gff, 'r') as infile:
for line in infile:
line_t = line.rstrip().split('\t')
constant_headers = ['CHROM','SOURCE','FEATURE','START','END','SCORE','STRAND','FRAME']
for i in range(8):
df_dict[constant_headers[i]].append(line_t[i])
opt_dict = {}
for opt in line.rstrip('\n').split('\t')[8].split(';'):
if opt != '':
opt_dict[opt.split('"')[0].strip()] = opt.split('"')[1].strip()
#Assign values to elements of current info line, taking into account new elements
curr_fields = []
for key in opt_dict.keys():
curr_fields.append(key)
if key not in opt_fields:
df_dict[key].extend([None]*lines)
df_dict[key].append(opt_dict[key])
opt_fields.append(key)
else:
df_dict[key].append(opt_dict[key])
#Assign None to missing elements from current line
for key in list(set(opt_fields) - set(curr_fields)):
df_dict[key].append(None)
lines += 1
#Convert to dataframe
df = pd.DataFrame.from_dict(df_dict, orient='columns')
#Uppercase all column names
if up is True:
df.columns = [x.upper() for x in list(df.columns)]
# Make sure START and END are integers
df["START"] = df["START"].astype('int')
df["END"] = df["END"].astype('int')
return df
For VCFs, pyvcf is probably a good place to start. However, its data structures are somewhat difficult to follow and we've found that reading VCFs as pandas tables and writing our own parsers is often easier and more straightforward.
Great, thanks so much @carloartieri. I agree with you that it is probably better just to parse the vcf directly, the format is pretty simple. I think I am going to add this function of yours and as well as a similar one for VCFs that return an identical data frame.
It should be possible to write a simple library to transparently loop through any of those files, and it will make the pipeline easier to use.