COMBINE-lab / salmon

🐟 🍣 đŸ± Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
771 stars 161 forks source link

question on Lexogen QuantSeq quatification - Salmon vs Star/Bowtie followed by htseq #365

Open tamuanand opened 5 years ago

tamuanand commented 5 years ago

Hi

I have a general question pertaining to quantifying QuantSeq data and comparing Salmon vs the alignment methods recommended by Lexogen (Star/Bowtie followed by htseq to get read counts per gene). Has anyone compared the 2 methods - would be very interested to know the findings.

I happen to see this issue which is still stated as "Open" - probably it should be marked as Closed?.

Based on the above issue and also this issue, I assume using --noLengthCorrection would be the recommended way to use Salmon for quantifying QuantSeq data - is that right?

In general, I am planning to use Salmon this way:

  1. index the transcriptome
  2. salmon quant -i {input.index} -l A -1 {input.R1} -2 {input.R2} -o {output} --noLengthCorrection --validateMappings --gcBias --seqBias --posBias

While using Salmon for quantification, are there any subtleties to be aware of based on the QuantSeq protocol (FWD vs REV) ?

Please advise.

Thanks in advance,

k3yavi commented 5 years ago

Hi @tamuanand ,

Thanks for the very interesting question. Personally I can't comment much on the Lexogen Quantseq quantification, however, the comparison of alignment based (both STAR/Bowtie2) and alignment free methods and their impact on RNA-seq quantification is within itself a very interesting comparison. In fact. we just released a preprint today about the same, you can check it out here.

tamuanand commented 5 years ago

Thanks for sharing the preprint. Is the SA method codebase available via github or via the combine lab site https://combine-lab.github.io/software/ . I know the preprint has supplementary material and code, but it would be valuable to have something like https://combine-lab.github.io/salmon/getting_started/ and/or https://salmon.readthedocs.io/en/latest/salmon.html

k3yavi commented 5 years ago

Hi @tamuanand , Thanks for raising this doubt. SA is already integrated into the salmon environment i.e. you just have to re index salmon using the generateDecoyTranscriptome.sh script from here and run salmon quant as you usually do w/ the --validateMappings additional command line flag.

rob-p commented 5 years ago

Hi @tamuanand,

Thanks @k3yavi for pointing out the major idea. Just to fill in some more details. The implementation of SA is, as Avi mentions, part of the mainline salmon code now (in the develop and master branch). We link, in the README, to some pre-constructed decoy-aware transcriptomes, but you can build your own for any organism where you have the transcriptome, the genome, and an annotation, using the script Avi linked to. There are a few ways to enable selective alignment, and the details are listed with the relevant flags in the release notes (we will be updating the documentation shortly with more detailed examples as well). Specifically, you can pass salmon the —validateMappings flag, which turns on selective alignment with some reasonable default parameters. You can, instead, pass the flag —mimicBT2, which is a meta-flag that enables selective alignment, and turns on a few other things that make the alignments more similar to the Bowtie2 parameters we discuss in the paper (e.g. it disallows orphan alignments). Finally, there is the —mimicStrictBT2 flag, which mimics Bowtie2 parameters that disallow indels; however, we generally don’t recommend this flag unless you have a particular reason for using it. For any of these, once you’ve built a decoy-aware index, you need not do anything else special during quantification. We’ll ping back here with more details once we have more examples in place etc.

tamuanand commented 5 years ago

Hi @k3yavi and @rob-p,

Thanks for the detailed info and pointing me to the links on your github page.

One suggestion regarding semantics (from the README page) - you might want to rethink if stating 80% homology is the correct way. I know it is prevalent in literature, etc but from a bioinformatics/computational biology expertise standpoint "80% homology etc" is wrong. I think you are meaning similarity/identity here when you refer to homology as written here on the README page:.

to align transcriptome to an exon masked genome, with 80% homology and extracts the mapped genomic interval.

Check these out: https://www.ncbi.nlm.nih.gov/books/NBK20255/#A23 https://twitter.com/MatthewMoscou/status/866227138575429633 http://bytesizebio.net/2009/07/15/distant-homology-and-being-a-little-pregnant/

Thanks once again,

rob-p commented 5 years ago

Hi @tamuanand,

Thanks for the suggestion. You're right, of course, and we should change the wording in that readme. The cause of the sequence similarity is not always known, and frankly, not important for our particular application. We adopted this term as shorthand given it's common use and also because the version of MashMap used to compute these sequence-similar regions was introduced in the paper A fast adaptive algorithm for computing whole-genome homology maps. In the preprint itself, we're generally careful to simply refer to these as sequence-similar regions ;).

tamuanand commented 5 years ago

Hi @rob-p

Thanks for getting back. I know there are lots of such instances in the literature which have wordings like 50% homology etc - that's why I shared the book chapter from Eugene Koonin's book and the other references/quotes from Walter Fitch. We both cannot change what has already been published, however, when we write something ourselves, we can change the paradigm and represent things correctly.

Also, the preprint paper has similar wordings that you might want to reconsider changing:

To obtain homologous sequences within a reference, we map the spliced transcript sequences against a version of the genome where all exon segments are hard-masked (i.e. replaced with N). We perform this mapping using MashMap 20, with segment size 500 and homology 80%.

Probably, change the first instance of homologous to 'identical' and homology 80% to' identity 80%'

And I do not want to digress from the main issue or take the sheen away from the great work from your group on the paper.

rob-p commented 5 years ago

@tamuanand,

Thank you for pointing out the relevant literature, and I definitely appreciate your clarity on this issue. Also, I completely agree with your suggested re-wordings in the manuscript, as they correct the mistaken terminology and make the overall intent even more clear. We will be sure to address this when we revise the pre-print. Thanks again!

tamuanand commented 5 years ago

Hello @rob-p

While I have the luxury of catching your attention, I am going to be sneaky :) and refer you back to my original question on this thread - would like to know your response.

To summarize, these are my questions:

  1. what Salmon quantcommand line options would you recommend for QuantSeq data -- I realize you introduced 'noLengthCorrection' specifically for QuantSeq as mentioned in Issue108 and Issue177
  1. likewise, what would be the command line if I chose to take SA approach (build gentrome.fa with decoys)

Suggestion: once the dust has settled after the printing of this new paper, you should include these command line suggestions for QuantSeq in your tutorial/readthedocs/README sections.

Thanks in advance

mtassia commented 4 years ago

Hello @rob-p

I want to ping your question above. I'm sitting on a heap of new QuantSeq data and wanted to know about the commands recommended for such data - was this ever resolved?

Cheers!