Open tamuanand opened 5 years ago
Hi @tamuanand ,
I think these are very important question and thanks for raising the issue. As you mention, In the preprint we put out two different modes of Selective Alignment: A) SA: The mashmap and bedtools based pipeline which follows old SalmonTools based pipeline. B) SAF: Inbuilt salmon pipeline to consume genome and follows this pipeline.
The distinction between the two comes from how the decoy sequence are actually generated. To answer your question point wise.
1.) That's correct SAF based pipeline follows the tutorial as mentioned in B above and uses the full genome as decoys.
2.) That's correct, if a user wan't to run SA method, then they should follow mashmap based tutorial A. This might be useful for situation where the index is too big to fit into the machine's memory.
3.) That's also correct, yes if you don't provide decoys -d
you can still run salmon on the transcriptome. We have just enabled the validateMapping option by default, which is also used in transcriptome only mode, currently there is no option to disable it.
4) That's also correct, we have dropped the quasi-mapping based support from the latest version, If you need to run quasi we have released 0.15
just as a last version into the archive.
5 & 6) Very good question, short answer is your default pipeline of VBEM is the recommended way. We have to use additional flags --mimicBT2 and --useEM
while comparing the methods in the preprint. RSEM can only do EM and as we were comparing against Bowtie2 we have to mimic it with more stricter requirements for fair comparison. We expect the performance to be better with VB based optimization and not using mimcBT2
.
@rob-p Feel free to add if I missed something.
Hi @k3yavi
Can you elaborate on 3 and 4 with command line usage examples? I feel you misunderstood my Question 3
My question 3 was "how to do salmon index
if I do not have a genome" with salmon/v1.0
If my understanding is right (based on your response 4 - about dropping quasi mapping support with salmon/v1.0), I believe you cannot use salmon/v1.0 to do something like this below for my Question 3 (salmon index in absence of genome) ;
salmon index -t txome_fasta -i txome_index
Other questions:
Thanks in advance to both @k3yavi and @rob-p
Right, in short salmon index -t txome_fasta -i txome_index
should work and both the versions of salmon (v0.15 and v1.0) is available on bioconda, check here, you may wanna try force update of conda.
I think the confusion is you are thinking of the concept of Selective Alignment as the same as aligning to transcriptome w/ decoys (can be genome or mashmap based). Although they are related methods but the concept of Selective Alignment predates the idea of decoy based alignment, checkout this paper from our lab where we discuss how Selectively Aligning difficult reads to just the transcriptome itself can result in improved quantification estimates compared to quasi or pseduo alignment.
To summarize: In version 1.0 A) SA: The mashmap and bedtools based pipeline which follows old SalmonTools based pipeline. B) SAF: Inbuilt salmon pipeline to consume genome and follows this pipeline. C) If you don't provide any decoys, salmon will do Selective Alignment just on the transcriptome. The Release notes you quoted just means you cannot disable this feature i.e. you cannot fall back to quasi-mapping (in quasi mapping there is no alignment of the reads at all).
In version 0.15.0 You cannot provide decoys and the transcriptome based mapping performed in this version would be quasi-mapping i.e. no Alignment of reads.
Hope it helps .
Hi @tamuanand,
Thank you for the detailed questions! Let me elaborate a bit on a few of @k3yavi's answers.
1&2) Yes; if you want to use SAF, you no longer need mashmap, as what you are essentially doing is treating the entire genome as a "decoy". As @k3yavi alludes, SA is still useful when you need to run in a very memory-constrained environment. After adopting the new pufferfish-based index, the size of the transcriptome plush mashmap 2 decoys becomes considerably smaller than the previous size of the transcriptome in earlier versions of salmon (<= 0.15.0). However, depending on the organism, indexing the entire genome as decoy, even though it yields the best accuracy, does require a bit more memory, as specified in the release notes for the 0.99 betas and 1.0.0.
3) Yes; it is still possible to use salmon index
without any decoy sequence. In this case, one can expect results similar to if you had aligned to the target transcriptome using Bowtie2. In this case, you perform indexing by simply not providing any --decoy
flag to the index
command. In that case, all of the records in the target fasta will be treated as valid and quantifiable targets. Of course, for reasons detailed in the pre-print --- the high sensitivity of both Bowtie2 and selective-alignment --- we recommend including either mashmap-derived decoys or the organism's genome as a decoy whenever possible.
4) Related to @k3yavi's response and my elaboration above: we have dropped quasi-mapping from 1.0.0 (though something akin to it may return in the future if there is sufficient demand and if the shortcomings described in the manuscript can be overcome). However, as I mention in part 3 above, this doesn't mean it's not possible to use v1.0.0 without an explicit decoy sequence. The --decoy
flag of the indexing command is optional, not required. We will update this in the documentation making it more explicit. However, as @k3yavi points out, it is true that if you wish to use quasi-mapping and selective-alignment against the full genome on the same machine, you will need both versions, as quasi-mapping is supported only in the RapMap, while indexing something on the scale of the genome when not using the pufferfish-based index has tremendous memory requirements (as is not recommended ).
5 & 6) To re-iterate @k3yavi's answer --- the extra flags used in the pre-print were only for the purpose of holding as many variables fixed as possible when comparing different approaches. It continues to be recommended to use the VBEM over the EM; it seems to perform better with respect to the ways in which we can measure and such improvements have also been documented in other work. The main effect of --mimicBT2
is to discard orphan alignments for the purposes of quantification. This is a more strict requirement than the default behavior of allowing orphans if there is no satisfactory alignment of both ends of a fragment. However, there is no obvious reason why it is better behavior than accounting for these orphan fragments (when appropriately adjusting the conditional probability given their distance from the transcript boundaries, as salmon does).
@rob-p and @k3yavi
Thanks for your answers.
Suggestion: It will be great if the getting started with Salmon document is updated to reflect all possible scenarios you list out above.
It might also be better to pull off that document till it gets updated OR if it is redirected to something more pertinent.
@rob-p @k3yavi
With release of salmon/v1.0, do you have any recommendations for salmon quant command line for QuantSeq
Many users (me included) would like to hear from you to the question posted here
@k3yavi @rob-p Thanks for work pertaining to new Salmon indexing methods as mentioned in this preprint: https://www.biorxiv.org/content/10.1101/657874v2
Some questions on salmon index building with salmon/v1.0 (as I am confused following the documentation at https://salmon.readthedocs.io/en/latest/salmon.html) - let me know if my understanding is correct
Is this how to create SAF indices - https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ - with these steps, I assume I do not have to separately download mashmap and bedtools software.
If one has to use SA method, does one still use the generateDecoyTranscriptome.sh method as listed here - https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh (and this requires the gff file, mashmap and bedtools software)
SA and SAF both require genome. Can I still use
salmon index
on the transcriptome file without using genome files? Based on the release notes quoted (copy/pasted) below, I am worried about the phrase"mapping without selective alignment is disabled"
).Salmon v1.0 release notes state:
--mimicBT2 and --useEM for SA and SAF
quantification methods. Is this the recommendation while using SA and SAF methods? From salmon v01.4.1 with SA method, I have all along used the defaultVBEM
and--validateMappings
based on info in SalmonReadTheDocs--mimicBT2 and --useEM for SA and SAF
methods so that you can compare to RSEM. Have you used the defaults for salmon quant for SA and SAF - or in other words, how does SA/SAF results compare against other methods listed in the preprint when you do not use the--mimicBT2 and --useEM
optionsThanks in advance,