COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
771 stars 161 forks source link

Transcript to Gene map: returning transcript as it's own gene #122

Closed demis001 closed 7 years ago

demis001 commented 7 years ago

@COMBINE-lab

Generate gene level summary count:

Got the following files: wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz

zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz >Homo_sapiens.GRCh38.87.cdna.ncrna.fa

salmon index -t Homo_sapiens.GRCh38.87.cdna.ncrna.fa -i Homo_sapiens.GRCh38.cdna.ncrna.quasi.index

salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.chr.gtf

[2017-02-22 17:28:03.310] [jointLog] [warning] couldn't find transcript named [ENST00000384587.1] in transcript <-> gene map; returning transcript as it's own gene

I see the fasta file has a transcript name as "ENST00000448914.1", but the gtf has in the form of "ENST00000448914", any idea how salmon resolve this? There is no "." after transcript name in gtf file.

Thanks @demis001

mdshw5 commented 7 years ago

You can handle this using https://github.com/mikelove/tximport. I actually made a pull request to add this feature (ignore versioning of the transcripts/genes): https://github.com/mikelove/tximport/pull/6. Works great.

demis001 commented 7 years ago

@mdshw5 I am planning to include "salmon" into one of my application. It seems tximport is a biocoductor package. I am coding in Python and don't want to include R in the design.

I prefer command line solution with -g option. Any idea if that works for gene level count?

Thanks demis001

mdshw5 commented 7 years ago

@demis001 salmon currently makes exact matches between the fasta headers and transcript annotations in the GTF, so no - it doesn't work. Since gene level summarization is pretty simple you could just use something like https://github.com/daler/gffutils to read your GTF, drop the version numbers from the GTF entries, then drop the version numbers from the salmon quant.sf file, and join the two yourself. The summarization from tx->gene is just summing each gene's transcripts' TPM values.

@rob-p: I do think this is a common enough issue that salmon could handle dropping accession.version numbers with an extra option.

rob-p commented 7 years ago

@demis001, @mdshw5 -- this seems reasonable and easy enough to do. I just don't want to go too deep into the rabbit hole since tximport is my goto recommendation for this problem in the general case. Also, Mike and I have recently added some nice features there (the ability to import gibbs / bootstrap results) that will make it an even more useful way to interact with and process salmon data downstream.

My question for implementing this feature is what version "split string" should I consider. Is it always '.'? Do some schemes use '_'? Does this string itself have to be a parameter?

demis001 commented 7 years ago

@rob-p

I think it would be nice if you write a function that accept any standard gtf/gff3 format from the command line using the -g option and generate a two column file suggested by @mdshw5 that map Transcript to Gene (used internally by salmon).

In my case, I am going to just do that (Accept any gtf from any source, generate a formatted Transcript-Gene map -> pass that to Salmon internally-> get gene count -> pass that to something else).

@demis001

mdshw5 commented 7 years ago

@rob-p I think it's reasonable to support only "accession.version" since that seems common (it's used for ensemble at least), and allowing more user parameters means more work implementing, testing, fixing...

I'm sure you don't need it, but I implemented something similar to allow a user to pass a custom "key function" in one of my python packages: https://github.com/mdshw5/pyfaidx#keyfn. I'm not sure how you would allow custom functions since you're using C++, and you might have to come up with an entire domain-specific language for this, but maybe something exists for this purpose...

demis001 commented 7 years ago

@rob-p I don't know what you mean by "accession.version", my initial question was, if you pass " -g Homo_sapiens.GRCh38.87.chr.gtf" to salmon to get a gene level count, it will fail to map Transcript to Gene Name.

Instead of asking the user to pass a two column annotation that map "Transcript to Gene", it would be nice to support standard format from Ensembl gtf file to generate a gene level count from a solmon CLI.

Am I missing something?

@demis001

rob-p commented 7 years ago

Hi @demis001,

What I mean is that the standard Ensemble format of transcript_name.transcript_version, separates the name from the version using a .. This is, in no way, a standard or universal rule. For example, what if the gtf file is coming from another source (Gencode, RefSeq, UCSC, something else)? We can't safely assume that one can universally just remove the portion of the identifier after the last . to get the un-versioned name. In this case, the question is, what is the right way to handle this, since users are employing all manner of different sources for their reference transcriptomes and not just Ensembl.

--Rob

mdshw5 commented 7 years ago

@demis001 Salmon's --geneMap | -g option does take a GTF file (although you're correct that it can also take a two-column text file, and I think accepting both types of files in the same option is a bit confusing). See some context in this old issue. I believe the issue you're seeing is that Ensembl will update transcript (and gene I think) versions like so:

!! this is not real data, just a toy example

ensembl v24
-------------
ENST0000001.1
ENST0000002.1

ensembl v25
-------------
ENST0000001.2
ENST0000002.2

and the GTF file you used to build the salmon index is "ensembl v24", but now you only have "ensembl v25" available during the salmon quant run.

The most correct thing to do at this point would be to either rebuild the salmon index using "ensembl v25" and rerun salmon quant -g ensembl_v25.gtf, or get the "ensembl v24" GTF file (re-download from Ensembl website) and pass this to salmon quant -g ensembl_v24.gtf.

The thing would probably be okay is to allow passing GTF files that do not exactly match the transcript accessions, where "ENST0000001" is the "accession" and ".1" is the "version" - hence "accession.version". This way you can just ignore the version part of the transcript/gene names in the GTF file for the purposes of constructing a tx<>gene map.

demis001 commented 7 years ago

@rob-p,

Sorry for a delayed response, I will try and let you know. I was out of office for a week. One more thing, does salmon allow proportional read assignment ( just curious)?

@demis001

rob-p commented 7 years ago

Hi @demis001,

Sure; let me know how things work out. Regarding your question, yes, Salmon allows proportional read assignments. It runs (by default) and expectation maximization algorithm to determine the allocations of reads to transcripts—allowing for proportional / fractional allocations—that maximizes the likelihood of the observed set of reads.

demis001 commented 7 years ago

Hi Rob,

I have downloaded the cdna and ncrna from ENSEMBL ftp site and merged, got the gtf file from the same site:

http://useast.ensembl.org/info/data/ftp/index.html

Replaced the version issue as follows: forward slash before dot, git removed it

sed -i 's/.[123456789]//g' Homo_sapiens.GRCh38.87.cdna.ncrna.fa`

Indexed it:

salmon index -t Homo_sapiens.GRCh38.87.cdna.ncrna.fa -i Homo_sapiens.GRCh38.cdna.ncrna.quasi.index

Did alignment:

salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.chr.gtf

Tried this too:

salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.gtf

Got this error:

[2017-03-01 14:52:23.285] [jointLog] [warning] couldn't find transcript named [ENST00000615101] in transcript <-> gene map; returning transcript as it's own gene Did this: (rseqpipe) [ddjima@jima:1013_quant]$ grep "ENST" quant.genes.sf | wc -l 21370 (rseqpipe) [ddjima@jima:1013_quant]$ grep "ENSG" quant.genes.sf | wc -l 56934

Seems don't find gene name for 21370 transcript ... Any idea?

Thank you very much!!!!!

k3yavi commented 7 years ago

Hey @demis001,

I was wondering if the transcript id is even present in the GTF at the first place or not. If I understand it correctly then if ENST00000615101 transcript Id is not present in the GTF, there is no way of knowing the transcript to gene map anyway, isn't it?

demis001 commented 7 years ago

Yes, all have entry in the .fa file but are missing from .gtf file

grep "ENST00000615101" Homo_sapiens.GRCh38.87.cdna.ncrna.fa ENST00000615101 ncrna chromosome:GRCh38:CHR_HSCHR15_4_CTG8:30951585:30951821:1 gene:ENSG00000276152 gene_biotype:misc_RNA transcript_biotype:misc_RNA gene_symbol:Metazoa_SRP description:Metazoan signal recognition particle RNA [Source:RFAM;Acc:RF00017]

grep "ENST00000615101" Homo_sapiens.GRCh38.87.gtf NA

Did you ever test salmon using ENSEMBL build?

k3yavi commented 7 years ago

Yes, we have used ENSEMBL data couple of times but mainly with coding DNA/RNA. I guess this particular GTF is not complete, in a way that the GTF does not have non-coding RNA. Can you verify that if any transcript from ncRNA file is present in your GTF?

rob-p commented 7 years ago

Hi @demis001,

I think what @k3yavi is asking is a little different. Specifically, he's saying that it seems to be the case that a mapping for this specific transcript name is missing from the .gtf file itself, in which case it isn't possible to assign it to a gene (the information must exist in the gtf to be parsed and used). We use salmon with ENSEMBL regularly, though as Avi states, usually with a complete / matched cdna .gtf file. Can you post a link to the exact .fa and .gtf file you're using and we can take a look?

--Rob

demis001 commented 7 years ago

Sure!

wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

Combine the above two files

zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.87.cdna.ncrna.fa

Get annotation only on known chr

wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz

Get all annotation

wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz

demis001 commented 7 years ago

@k3yavi

The annotation contain ncrna:

grep "MIR" Homo_sapiens.GRCh38.87.chr.gtf | head -n 2 1 mirbase gene 17369 17436 . - . gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "mirbase"; gene_biotype "miRNA"; 1 mirbase transcript 17369 17436 . - . gene_id "ENSG00000278267"; gene_version "1"; transcript_id "ENST00000619216"; transcript_version "1"; gene_name "MIR6859-1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-1-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; transcript_support_level "NA";

rob-p commented 7 years ago

Hi @demis001,

Indeed, while the annotation contains mappings for some non-coding transcripts, it doesn't contain them for all such transcripts. For example, the transcript you mention above ENST00000615101 doesn't seem to be present in either gtf file:

~ ❯❯❯ grep "ENST00000615101" Homo_sapiens.GRCh38.87.gtf
~ ❯❯❯ grep "ENST00000615101" Homo_sapiens.GRCh38.87.chr.gtf

In this case, there is no way to extract a transcript <-> gene mapping for this transcript from the GTF, since the GTF does not actually contain any such mapping. The Ensembl website does seem to list a target gene for this transcript, but that information is absent from the GTF file itself. Unfortunately, there are only three options for handling such cases:

I think the current behavior is the most reasonable, but I'm open to being convinced of other strategies.

--Rob

demis001 commented 7 years ago

@rob-p Thank you very much! which salmon option give me this:

Don't include the transcript at all in the gene-level output.

This the correct option in this scenario.

I think setting the default to Don't include the transcript at all in the gene-level output. is the right choice. If the transcripts aren't exist in the annotation file, there is no need to bother.

rob-p commented 7 years ago

Hi @demis001,

I see. There is only one behavior built in (i.e., report the transcript as it's own gene). You can easily filter the gene -level quantification file to get rid of transcripts like this though. The easiest way would be something like:

> cat <(head -1 quant.genes.sf) <(grep "ENSG*" quant.genes.sf) > quant.genes.filtered.sf

That is, simply filter the quant.genes.sf file for any line that matches ENSG, leaving the rest behind (and adding the header line).

--Rob

demis001 commented 7 years ago

I can do that, but wish if the option exist in salmon... mark for future upgrade.

I can't thank you enough both of you for your time.

Thank you very much for your time! You may close the issue now. @demis001

rob-p commented 7 years ago

I'll create another issue with this as a feature request.

Thanks! Rob