Closed demis001 closed 7 years ago
You can handle this using https://github.com/mikelove/tximport. I actually made a pull request to add this feature (ignore versioning of the transcripts/genes): https://github.com/mikelove/tximport/pull/6. Works great.
@mdshw5 I am planning to include "salmon" into one of my application. It seems tximport is a biocoductor package. I am coding in Python and don't want to include R in the design.
I prefer command line solution with -g option. Any idea if that works for gene level count?
Thanks demis001
@demis001 salmon currently makes exact matches between the fasta headers and transcript annotations in the GTF, so no - it doesn't work. Since gene level summarization is pretty simple you could just use something like https://github.com/daler/gffutils to read your GTF, drop the version numbers from the GTF entries, then drop the version numbers from the salmon quant.sf file, and join the two yourself. The summarization from tx->gene is just summing each gene's transcripts' TPM values.
@rob-p: I do think this is a common enough issue that salmon could handle dropping accession.version numbers with an extra option.
@demis001, @mdshw5 -- this seems reasonable and easy enough to do. I just don't want to go too deep into the rabbit hole since tximport is my goto recommendation for this problem in the general case. Also, Mike and I have recently added some nice features there (the ability to import gibbs / bootstrap results) that will make it an even more useful way to interact with and process salmon data downstream.
My question for implementing this feature is what version "split string" should I consider. Is it always '.'? Do some schemes use '_'? Does this string itself have to be a parameter?
@rob-p
I think it would be nice if you write a function that accept any standard gtf/gff3 format from the command line using the -g option and generate a two column file suggested by @mdshw5 that map Transcript to Gene (used internally by salmon).
In my case, I am going to just do that (Accept any gtf from any source, generate a formatted Transcript-Gene map -> pass that to Salmon internally-> get gene count -> pass that to something else).
@demis001
@rob-p I think it's reasonable to support only "accession.version" since that seems common (it's used for ensemble at least), and allowing more user parameters means more work implementing, testing, fixing...
I'm sure you don't need it, but I implemented something similar to allow a user to pass a custom "key function" in one of my python packages: https://github.com/mdshw5/pyfaidx#keyfn. I'm not sure how you would allow custom functions since you're using C++, and you might have to come up with an entire domain-specific language for this, but maybe something exists for this purpose...
@rob-p I don't know what you mean by "accession.version", my initial question was, if you pass " -g Homo_sapiens.GRCh38.87.chr.gtf" to salmon to get a gene level count, it will fail to map Transcript to Gene Name.
Instead of asking the user to pass a two column annotation that map "Transcript to Gene", it would be nice to support standard format from Ensembl gtf file to generate a gene level count from a solmon CLI.
Am I missing something?
@demis001
Hi @demis001,
What I mean is that the standard Ensemble format of transcript_name.transcript_version
, separates the name from the version using a .
. This is, in no way, a standard or universal rule. For example, what if the gtf file is coming from another source (Gencode, RefSeq, UCSC, something else)? We can't safely assume that one can universally just remove the portion of the identifier after the last .
to get the un-versioned name. In this case, the question is, what is the right way to handle this, since users are employing all manner of different sources for their reference transcriptomes and not just Ensembl.
--Rob
@demis001 Salmon's --geneMap | -g
option does take a GTF file (although you're correct that it can also take a two-column text file, and I think accepting both types of files in the same option is a bit confusing). See some context in this old issue. I believe the issue you're seeing is that Ensembl will update transcript (and gene I think) versions like so:
!! this is not real data, just a toy example
ensembl v24
-------------
ENST0000001.1
ENST0000002.1
ensembl v25
-------------
ENST0000001.2
ENST0000002.2
and the GTF file you used to build the salmon index is "ensembl v24", but now you only have "ensembl v25" available during the salmon quant
run.
The most correct thing to do at this point would be to either rebuild the salmon index using "ensembl v25" and rerun salmon quant -g ensembl_v25.gtf
, or get the "ensembl v24" GTF file (re-download from Ensembl website) and pass this to salmon quant -g ensembl_v24.gtf
.
The thing would probably be okay is to allow passing GTF files that do not exactly match the transcript accessions, where "ENST0000001" is the "accession" and ".1" is the "version" - hence "accession.version". This way you can just ignore the version part of the transcript/gene names in the GTF file for the purposes of constructing a tx<>gene map.
@rob-p,
Sorry for a delayed response, I will try and let you know. I was out of office for a week. One more thing, does salmon allow proportional read assignment ( just curious)?
@demis001
Hi @demis001,
Sure; let me know how things work out. Regarding your question, yes, Salmon allows proportional read assignments. It runs (by default) and expectation maximization algorithm to determine the allocations of reads to transcripts—allowing for proportional / fractional allocations—that maximizes the likelihood of the observed set of reads.
Hi Rob,
I have downloaded the cdna and ncrna from ENSEMBL ftp site and merged, got the gtf file from the same site:
http://useast.ensembl.org/info/data/ftp/index.html
Replaced the version issue as follows: forward slash before dot, git removed it
sed -i 's/.[123456789]//g' Homo_sapiens.GRCh38.87.cdna.ncrna.fa`
Indexed it:
salmon index -t Homo_sapiens.GRCh38.87.cdna.ncrna.fa -i Homo_sapiens.GRCh38.cdna.ncrna.quasi.index
Did alignment:
salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.chr.gtf
Tried this too:
salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.gtf
Got this error:
[2017-03-01 14:52:23.285] [jointLog] [warning] couldn't find transcript named [ENST00000615101] in transcript <-> gene map; returning transcript as it's own gene
Did this:(rseqpipe) [ddjima@jima:1013_quant]$ grep "ENST" quant.genes.sf | wc -l 21370 (rseqpipe) [ddjima@jima:1013_quant]$ grep "ENSG" quant.genes.sf | wc -l 56934
Seems don't find gene name for 21370 transcript ... Any idea?
Thank you very much!!!!!
Hey @demis001,
I was wondering if the transcript id is even present in the GTF at the first place or not.
If I understand it correctly then if ENST00000615101
transcript Id is not present in the GTF, there is no way of knowing the transcript to gene map anyway, isn't it?
Yes, all have entry in the .fa file but are missing from .gtf file
grep "ENST00000615101" Homo_sapiens.GRCh38.87.cdna.ncrna.fa ENST00000615101 ncrna chromosome:GRCh38:CHR_HSCHR15_4_CTG8:30951585:30951821:1 gene:ENSG00000276152 gene_biotype:misc_RNA transcript_biotype:misc_RNA gene_symbol:Metazoa_SRP description:Metazoan signal recognition particle RNA [Source:RFAM;Acc:RF00017]
grep "ENST00000615101" Homo_sapiens.GRCh38.87.gtf NA
Did you ever test salmon using ENSEMBL build?
Yes, we have used ENSEMBL data couple of times but mainly with coding DNA/RNA. I guess this particular GTF is not complete, in a way that the GTF does not have non-coding RNA. Can you verify that if any transcript from ncRNA file is present in your GTF?
Hi @demis001,
I think what @k3yavi is asking is a little different. Specifically, he's saying that it seems to be the case that a mapping for this specific transcript name is missing from the .gtf file itself, in which case it isn't possible to assign it to a gene (the information must exist in the gtf to be parsed and used). We use salmon with ENSEMBL regularly, though as Avi states, usually with a complete / matched cdna .gtf
file. Can you post a link to the exact .fa
and .gtf
file you're using and we can take a look?
--Rob
Sure!
wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
Combine the above two files
zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz > Homo_sapiens.GRCh38.87.cdna.ncrna.fa
Get annotation only on known chr
wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.chr.gtf.gz
Get all annotation
wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz
@k3yavi
The annotation contain ncrna:
grep "MIR" Homo_sapiens.GRCh38.87.chr.gtf | head -n 2 1 mirbase gene 17369 17436 . - . gene_id "ENSG00000278267"; gene_version "1"; gene_name "MIR6859-1"; gene_source "mirbase"; gene_biotype "miRNA"; 1 mirbase transcript 17369 17436 . - . gene_id "ENSG00000278267"; gene_version "1"; transcript_id "ENST00000619216"; transcript_version "1"; gene_name "MIR6859-1"; gene_source "mirbase"; gene_biotype "miRNA"; transcript_name "MIR6859-1-201"; transcript_source "mirbase"; transcript_biotype "miRNA"; tag "basic"; transcript_support_level "NA";
Hi @demis001,
Indeed, while the annotation contains mappings for some non-coding transcripts, it doesn't contain them for all such transcripts. For example, the transcript you mention above ENST00000615101
doesn't seem to be present in either gtf
file:
~ ❯❯❯ grep "ENST00000615101" Homo_sapiens.GRCh38.87.gtf
~ ❯❯❯ grep "ENST00000615101" Homo_sapiens.GRCh38.87.chr.gtf
In this case, there is no way to extract a transcript <-> gene mapping for this transcript from the GTF, since the GTF does not actually contain any such mapping. The Ensembl website does seem to list a target gene for this transcript, but that information is absent from the GTF file itself. Unfortunately, there are only three options for handling such cases:
I think the current behavior is the most reasonable, but I'm open to being convinced of other strategies.
--Rob
@rob-p Thank you very much! which salmon option give me this:
Don't include the transcript at all in the gene-level output.
This the correct option in this scenario.
I think setting the default to Don't include the transcript at all in the gene-level output.
is the right choice. If the transcripts aren't exist in the annotation file, there is no need to bother.
Hi @demis001,
I see. There is only one behavior built in (i.e., report the transcript as it's own gene). You can easily filter the gene -level quantification file to get rid of transcripts like this though. The easiest way would be something like:
> cat <(head -1 quant.genes.sf) <(grep "ENSG*" quant.genes.sf) > quant.genes.filtered.sf
That is, simply filter the quant.genes.sf
file for any line that matches ENSG
, leaving the rest behind (and adding the header line).
--Rob
I can do that, but wish if the option exist in salmon... mark for future upgrade.
I can't thank you enough both of you for your time.
Thank you very much for your time! You may close the issue now. @demis001
I'll create another issue with this as a feature request.
Thanks! Rob
@COMBINE-lab
Generate gene level summary count:
Got the following files:
wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
zcat Homo_sapiens.GRCh38.cdna.all.fa.gz Homo_sapiens.GRCh38.ncrna.fa.gz >Homo_sapiens.GRCh38.87.cdna.ncrna.fa
salmon index -t Homo_sapiens.GRCh38.87.cdna.ncrna.fa -i Homo_sapiens.GRCh38.cdna.ncrna.quasi.index
salmon quant -i /data1/genome/hg38_r87/Homo_sapiens.GRCh38.cdna.ncrna.quasi.index -l ISR -1 1013_S7_R1_001.fastq.gz -2 1013_S7_R2_001.fastq.gz -p 8 -o 1013_quant -g /data1/genome/hg38_r87/Homo_sapiens.GRCh38.87.chr.gtf
[2017-02-22 17:28:03.310] [jointLog] [warning] couldn't find transcript named [ENST00000384587.1] in transcript <-> gene map; returning transcript as it's own gene
I see the fasta file has a transcript name as "ENST00000448914.1", but the gtf has in the form of "ENST00000448914", any idea how salmon resolve this? There is no "." after transcript name in gtf file.
Thanks @demis001