TheFraserLab / ASEr

Get ASE counts from BAMs or raw fastq data -- repackage of pipeline by Carlo Artieri
MIT License
6 stars 3 forks source link

Allow use of GFF/GTF/VCF files instead of bed files. #7

Open MikeDacre opened 8 years ago

MikeDacre commented 8 years ago

It should be possible to write a simple library to transparently loop through any of those files, and it will make the pipeline easier to use.

carloartieri commented 8 years ago

Here's a function I wrote for the purpose of reading GFF files in as pandas dataframes:

def gff_to_df(gff, up=True):
    """Return a pandas dataframe containing all of the data in a GFF file.
    See http://uswest.ensembl.org/info/website/upload/gff.html for GFF specifications

    Notes
    -----
    The first eight column headers are standard:

    ['CHROM', 'SOURCE', 'FEATURE', 'START', 'END', 'SCORE', 'STRAND', 'FRAME']

    Additional columns will be created for each value found in the ninth, attribute column.
    If up = True (default), the column headers will all be uppercase versions of the attribute
    names, e.g.:

    ['GENE_ID', 'TRANSCRIPT_ID', 'EXON']

    Lines that do not have attributes corresponding to a particular column are assigned None

    REQUIREMENTS:
    import pandas as pd
    from collections import defaultdict
"""

    #Build dictionary
    df_dict = defaultdict(list)
    opt_fields = []
    lines = 0
    with open(gff, 'r') as infile:
        for line in infile:
            line_t = line.rstrip().split('\t')
            constant_headers = ['CHROM','SOURCE','FEATURE','START','END','SCORE','STRAND','FRAME']
            for i in range(8):
                df_dict[constant_headers[i]].append(line_t[i])

            opt_dict = {}
            for opt in line.rstrip('\n').split('\t')[8].split(';'):
                if opt != '':
                    opt_dict[opt.split('"')[0].strip()] = opt.split('"')[1].strip()

            #Assign values to elements of current info line, taking into account new elements
            curr_fields = []
            for key in opt_dict.keys():
                curr_fields.append(key)
                if key not in opt_fields:
                    df_dict[key].extend([None]*lines)
                    df_dict[key].append(opt_dict[key])
                    opt_fields.append(key)
                else:
                    df_dict[key].append(opt_dict[key])

            #Assign None to missing elements from current line
            for key in list(set(opt_fields) - set(curr_fields)):
                df_dict[key].append(None)

            lines += 1

    #Convert to dataframe
    df = pd.DataFrame.from_dict(df_dict, orient='columns')

    #Uppercase all column names
    if up is True:
        df.columns = [x.upper() for x in list(df.columns)]

    # Make sure START and END are integers
    df["START"] = df["START"].astype('int')
    df["END"] = df["END"].astype('int')

    return df

For VCFs, pyvcf is probably a good place to start. However, its data structures are somewhat difficult to follow and we've found that reading VCFs as pandas tables and writing our own parsers is often easier and more straightforward.

MikeDacre commented 8 years ago

Great, thanks so much @carloartieri. I agree with you that it is probably better just to parse the vcf directly, the format is pretty simple. I think I am going to add this function of yours and as well as a similar one for VCFs that return an identical data frame.