Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Specification for GTF file #30

Open dylkot opened 6 years ago

dylkot commented 6 years ago

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL: ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems: Missing transcript_name at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97) at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

Hoohm commented 6 years ago

Hey ! Glad you enjoy the tool. I've had another user having similar issue with arabidopsis annotation.

It would be nice to have some safety checks to deal with those issues. Have you finally managed to fix this problem? If yes, would share what you have done so that we can integrate it?

On Mon, Apr 23, 2018, 14:50 Dylan Kotliar notifications@github.com wrote:

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL: ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems: Missing transcript_name at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97) at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39) at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71) at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57) at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166) at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94) at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Hoohm/dropSeqPipe/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaCXqC84rNG6RXtCtXQCH8jlWBKA9ks5trc3zgaJpZM4Tf1KX .

dylkot commented 6 years ago

Hi. Yep, I fixed it by manually appending gene_name and transcript_name entries to the rows that were missing them in the GTF. I simply used the gene_id and transcript_id for those that were lacking them. I then found that gene_name (which is used by the pipeline as the gene identifier) was not unique in the Macaque annotation. So I overwrote the gene_name entry in the attribute column in the format geneid__genename so it would propagate information for both features through the pipeline. I'm not sure which of these features you would want to add into the package or if you would want to just indicate that gene_name and transcript_name are required attributes. In any case, I'm copying some relevant code here below:

` import pandas as pd import numpy as np import re

def append_gtf_attr(ingtf, outgtf):
    '''Adds gene_name and transcript_name fields to rows that are missing them. Saves gene_name 
       in the format geneid__genename'''
    gtf = pd.read_csv(ingtf, sep='\t', header=None, comment='#')

    ## Extract gene_id and transcript_id
    gtf['gid'] = gtf[8].apply(lambda x: re.findall('gene_id "(.+?)"', x)[0])
    gtf['tid'] = np.nan
    tidind = gtf[2] != 'gene'
    gtf.loc[tidind, 'tid'] = gtf.loc[tidind,8].apply(lambda x: re.findall('transcript_id "(.+?)"', x)[0])

    ## Add attr column to hold our updated attribute info
    gtf['attr'] = gtf[8].copy()

    ## Use gene_id as gene_name for rows that don't have gene_name in attr
    hasgn = gtf['attr'].apply(lambda x: 'gene_name' in x)
    gtf.loc[~hasgn, 'attr'] += ' gene_name "' + gtf.loc[~hasgn,'gid'] + '";'

    ## Use transcript_id as transcript_name for rows missing transcript_name
    hastn = gtf['attr'].apply(lambda x: 'transcript_name' in x)
    notnidx = ~hastn & (gtf[2]!='gene')
    gtf.loc[notnidx, 'attr'] += ' transcript_name "' + gtf.loc[notnidx,'tid'] + '";'

    ## Update gene_name to be geneid__genename format
    gtf['gene_name'] = gtf['attr'].apply(lambda x: re.findall('gene_name "(.+?)"', x)[0])
    gtf['newname'] = gtf['gid'] + '__' + gtf['gene_name']
    strbase = 'gene_name "%s"'
    gtf['attr'] = gtf.apply(lambda x: x['attr'].replace(strbase % x['gene_name'], strbase % x['newname']), axis=1)

    ## Drop extraneous columns and save resulting file
    gtf = gtf.drop([8, 'gid', 'tid', 'gene_name', 'newname'], axis=1)
    gtf.to_csv(outgtf, sep='\t', index=False, header=False, quoting=3)

`

Hoohm commented 5 years ago

Hey @dylkot ! I'm working on automating the reference and genome download generation. I'm not familiar with all possible species that people use for single cell and I would greatly appreciate help for specific ones such as Rhesus Macaque.

Would you be able to help integrate this once the new version is out?

dylkot commented 5 years ago

Sure! I'd be happy to help with integration for Rhesus Macaque. By 'new version', are you referring to incorporating Drop-seq tools 2.0? I was just starting to wonder about how I will use that going forward because I would especially like to have quantitation of the intronic reads in my data

Hoohm commented 5 years ago

Version 0.4 and many more things.

Which functionnality would you specifically need?

dylkot commented 5 years ago

Exciting! Really just the ability to generate a count matrices for reads falling in introns in addition to reads falling in the coding sequence.

Hoohm commented 5 years ago

Hey! If you want to try out the new feature for automatic mixed species download, merging and generation, you can test out the new 0.4 version!

dylkot commented 5 years ago

Awesome, I just got a new load of data and so I'll run it with the new version! I'll let you know how it goes.

dylkot commented 5 years ago

Is there documentation anywhere on how to setup the binaries to run the new version? When I try to run it with the environment that was working for the previous version of dropSeqPipe, I get the error:

/bin/bash: cutadapt: command not found

I can install cutadapt but I imagine there might be several changes to the environment including downloading the next version of Drop-seq_tools and I just want to make sure I do all of that before I try to run things.

Hoohm commented 5 years ago

are you adding --use-conda to the command? Because that would download and activate the envs for each rule and you should not get this kind of errors.

dylkot commented 5 years ago

Nope, I wasn't. Sorry I missed that! Will give it a try.