XPRESSyourself / XPRESSpipe

An alignment and analysis pipeline for Ribosome Profiling and RNA-seq data
https://xpresspipe.readthedocs.io/en/latest/
GNU General Public License v3.0
12 stars 4 forks source link

curateReference gtfModify issue #39

Closed cjdecker closed 4 years ago

cjdecker commented 4 years ago

I ran the following command to curate reference for analyzing ribosome profiling data.

xpresspipe curateReference --output ./ --fasta genome_fasta/ --gtf ./transcripts.gtf --protein_coding --truncate --longest_transcript --sjdbOverhang 37

And I got this message before the process ended File "/Users/carolyndecker/anaconda2/envs/xpresspipe/lib/python3.7/site-packages/XPRESSpipe-0.6.0-py3.7.egg/xpresspipe/gtfModify.py", line 244, in get_chunks raise Exception('No gene or transcript records found in GTF') Exception: No gene or transcript records found in GTF

I noticed in the geneInfo.tab file that it says "MissingGeneType" $ head -3 geneInfo.tab 70481 NM_001276351 NM_001276351 MissingGeneType NR_075077 NR_075077 MissingGeneType

I am collaborating with a group on this project and they use hg38_refseq gtf file rather than ensembl gtf. I am wondering if this is an issue with XPRESSpipe not being able to parse required information from the refseq gtf (see below example lines from transcripts.gtf file).

Would I be able to get around this issue by modifying the geneInfo.tab file so that transcripts with NM prefix are assigned as protein coding and NR non coding?

Appreciate your help in resolving this issue, Carolyn $ head transcripts.gtf chr1 hg38_refGene stop_codon 67093005 67093007 0.000000 - . gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene CDS 67093008 67093604 0.000000 - 0 gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene exon 67092166 67093604 0.000000 - . gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene CDS 67095235 67095421 0.000000 - 1 gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene exon 67095235 67095421 0.000000 - . gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene CDS 67096252 67096321 0.000000 - 2 gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene exon 67096252 67096321 0.000000 - . gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene CDS 67115352 67115464 0.000000 - 1 gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene exon 67115352 67115464 0.000000 - . gene_id "NM_001276351"; transcript_id "NM_001276351"; chr1 hg38_refGene CDS 67125752 67125909 0.000000 - 0 gene_id "NM_001276351"; transcript_id "NM_001276351";

j-berg commented 4 years ago

Hi @cjdecker,

The GTF modification step was designed specifically with Ensembl formatting in mind. This error is arising due to a difference in formatting between UCSC and Ensembl GTF formatting, where Ensembl has a separate record summarizing the entire gene, labeled gene in column 3 of the GTF, whereas UCSC appears to exclude this record (at least by default). The geneInfo.tab file is output by STAR and is not involved in the GTF modification for CDS truncation, etc.

I think the most straightforward solution to this issue would be for me to include a UCSC flag during this step that can be provided and will help this module handle this difference in formatting. I should be able to get to this in the next few days and can at least get a sense of how tractable this solution will be. I will keep you updated.

Jordan

cjdecker commented 4 years ago

Hi Jordan, I appreciate your help in using the alternative gtf file. I am going ahead and doing the analysis with ensembl gtf just to get familiar with running your pipeline. Carolyn

From: Jordan Berg notifications@github.com Reply-To: XPRESSyourself/XPRESSpipe reply@reply.github.com Date: Sunday, April 26, 2020 at 11:40 AM To: XPRESSyourself/XPRESSpipe XPRESSpipe@noreply.github.com Cc: Carolyn Decker carolyn.decker@colorado.edu, Mention mention@noreply.github.com Subject: Re: [XPRESSyourself/XPRESSpipe] curateReference gtfModify issue (#39)

Hi @cjdeckerhttps://github.com/cjdecker,

The GTF modification step was designed specifically with Ensembl formatting in mind. This error is arising due to a difference in formatting between UCSC and Ensembl GTF formatting, where Ensembl has a separate record summarizing the entire gene, labeled gene in column 3 of the GTF, whereas UCSC appears to exclude this record (at least by default). The geneInfo.tab file is output by STAR and is not involved in the GTF modification for CDS truncation, etc.

I think the most straightforward solution to this issue would be for me to include a UCSC flag during this step that can be provided and will help this module handle this difference in formatting. I should be able to get to this in the next few days and can at least get a sense of how tractable this solution will be. I will keep you updated.

Jordan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/XPRESSyourself/XPRESSpipe/issues/39#issuecomment-619592609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC44YYXAROVPDLDSE7XQ52LRORWZPANCNFSM4MPDPJOQ.

j-berg commented 4 years ago

Hi @cjdecker,

I have a tentative UCSC GTF sub-module built-in and will be writing and running tests on these modifications today. Would you mind attaching a copy of the GTF you plan on using here in the issues so I can make sure I got all the formatting correct so the behavior is stable. I have currently tested the modifyGTF sub-module using the attached file format. hg38.refGene.gtf.gz

As an additional warning, the definitions used for gene records with the --longest_transcript option are dependent on Ensembl annotations in the GTF, so I would not recommend providing that option when using a UCSC GTF. In most cases, leaving the argument is fine and will not affect downstream counts. This is mostly for a few QC packages that require such a file.

j-berg commented 4 years ago

See 64ad259 and 61a5c77 for updates pertaining to this request. Updates available in XPRESSpipe v0.6.1. Please download and re-install to access these features.

Please re-open this issue if there are related problems with this addition to the software. Thanks for your interest in XPRESSpipe! Feel free to reach out with any additional questions or requests!

cjdecker commented 4 years ago

Hi Jordan, I am sorry I missed this email yesterday. I appreciate you making modifications so that refseq gtf files should be able to be used. I will reinstall and test new version in the next few days. Thanks for your time, Carolyn

From: Jordan Berg notifications@github.com Reply-To: XPRESSyourself/XPRESSpipe reply@reply.github.com Date: Wednesday, April 29, 2020 at 1:38 PM To: XPRESSyourself/XPRESSpipe XPRESSpipe@noreply.github.com Cc: Carolyn Decker carolyn.decker@colorado.edu, Mention mention@noreply.github.com Subject: Re: [XPRESSyourself/XPRESSpipe] curateReference gtfModify issue (#39)

Hi @cjdeckerhttps://github.com/cjdecker,

I have a tentative UCSC GTF sub-module built-in and will be writing and running tests on these modifications today. Would you mind attaching a copy of the GTF you plan on using here in the issues so I can make sure I got all the formatting correct so the behavior is stable. I have currently tested the modifyGTF sub-module using the attached file format. hg38.refGene.gtf.gzhttps://github.com/XPRESSyourself/XPRESSpipe/files/4554069/hg38.refGene.gtf.gz

As an additional warning, the definitions used for gene records with the --longest_transcript option are dependent on Ensembl annotations in the GTF, so I would not recommend providing that option when using a UCSC GTF. In most cases, leaving the argument is fine and will not affect downstream counts. This is mostly for a few QC packages that require such a file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/XPRESSyourself/XPRESSpipe/issues/39#issuecomment-621415136, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC44YYUJ6KAGZ4FUJUH2XJ3RPB63ZANCNFSM4MPDPJOQ.

j-berg commented 4 years ago

No worries! Let me know if there's anything else I can help out with!