Specification for GTF file

dylkot commented 6 years ago

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL: ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems: Missing transcript_name at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97) at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

Hoohm commented 6 years ago

Hey ! Glad you enjoy the tool. I've had another user having similar issue with arabidopsis annotation.

It would be nice to have some safety checks to deal with those issues. Have you finally managed to fix this problem? If yes, would share what you have done so that we can integrate it?

On Mon, Apr 23, 2018, 14:50 Dylan Kotliar notifications@github.com wrote:

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL: ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems: Missing transcript_name at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97) at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaCXqC84rNG6RXtCtXQCH8jlWBKA9ks5trc3zgaJpZM4Tf1KX .

dylkot commented 6 years ago

Hi. Yep, I fixed it by manually appending gene_name and transcript_name entries to the rows that were missing them in the GTF. I simply used the gene_id and transcript_id for those that were lacking them. I then found that gene_name (which is used by the pipeline as the gene identifier) was not unique in the Macaque annotation. So I overwrote the gene_name entry in the attribute column in the format geneid__genename so it would propagate information for both features through the pipeline. I'm not sure which of these features you would want to add into the package or if you would want to just indicate that gene_name and transcript_name are required attributes. In any case, I'm copying some relevant code here below:

` import pandas as pd import numpy as np import re

def append_gtf_attr(ingtf, outgtf):
    '''Adds gene_name and transcript_name fields to rows that are missing them. Saves gene_name 
       in the format geneid__genename'''
    gtf = pd.read_csv(ingtf, sep='\t', header=None, comment='#')

    ## Extract gene_id and transcript_id
    gtf['gid'] = gtf[8].apply(lambda x: re.findall('gene_id "(.+?)"', x)[0])
    gtf['tid'] = np.nan
    tidind = gtf[2] != 'gene'
    gtf.loc[tidind, 'tid'] = gtf.loc[tidind,8].apply(lambda x: re.findall('transcript_id "(.+?)"', x)[0])

    ## Add attr column to hold our updated attribute info
    gtf['attr'] = gtf[8].copy()

    ## Use gene_id as gene_name for rows that don't have gene_name in attr
    hasgn = gtf['attr'].apply(lambda x: 'gene_name' in x)
    gtf.loc[~hasgn, 'attr'] += ' gene_name "' + gtf.loc[~hasgn,'gid'] + '";'

    ## Use transcript_id as transcript_name for rows missing transcript_name
    hastn = gtf['attr'].apply(lambda x: 'transcript_name' in x)
    notnidx = ~hastn & (gtf[2]!='gene')
    gtf.loc[notnidx, 'attr'] += ' transcript_name "' + gtf.loc[notnidx,'tid'] + '";'

    ## Update gene_name to be geneid__genename format
    gtf['gene_name'] = gtf['attr'].apply(lambda x: re.findall('gene_name "(.+?)"', x)[0])
    gtf['newname'] = gtf['gid'] + '__' + gtf['gene_name']
    strbase = 'gene_name "%s"'
    gtf['attr'] = gtf.apply(lambda x: x['attr'].replace(strbase % x['gene_name'], strbase % x['newname']), axis=1)

    ## Drop extraneous columns and save resulting file
    gtf = gtf.drop([8, 'gid', 'tid', 'gene_name', 'newname'], axis=1)
    gtf.to_csv(outgtf, sep='\t', index=False, header=False, quoting=3)

`

Hoohm commented 5 years ago

Hey @dylkot ! I'm working on automating the reference and genome download generation. I'm not familiar with all possible species that people use for single cell and I would greatly appreciate help for specific ones such as Rhesus Macaque.

Would you be able to help integrate this once the new version is out?