COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
769 stars 161 forks source link

Question on building Salmon indices with salmon/v1.0 #442

Open tamuanand opened 4 years ago

tamuanand commented 4 years ago

@k3yavi @rob-p Thanks for work pertaining to new Salmon indexing methods as mentioned in this preprint: https://www.biorxiv.org/content/10.1101/657874v2

Some questions on salmon index building with salmon/v1.0 (as I am confused following the documentation at https://salmon.readthedocs.io/en/latest/salmon.html) - let me know if my understanding is correct

  1. Is this how to create SAF indices - https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ - with these steps, I assume I do not have to separately download mashmap and bedtools software.

  2. If one has to use SA method, does one still use the generateDecoyTranscriptome.sh method as listed here - https://github.com/COMBINE-lab/SalmonTools/blob/master/scripts/generateDecoyTranscriptome.sh (and this requires the gff file, mashmap and bedtools software)

  3. SA and SAF both require genome. Can I still use salmon index on the transcriptome file without using genome files? Based on the release notes quoted (copy/pasted) below, I am worried about the phrase "mapping without selective alignment is disabled").

Salmon v1.0 release notes state:

changes since v.014.1 In this release of salmon, selective-alignment is enabled by default (and, in fact, mapping without selective-alignemnt is disabled). We may explore, in the future, ways to allow disabling selecive-alignment under the new mapping approach, but at this point, it is always enabled.

  1. Page 18 of your preprint pdf states that you used "salmon v0.15.0 for quasi-mapping" - so I am assuming I have to keep 2 versions of salmon in my system if I have to do both quasi-mapping and SA/SAF?
  1. Page 19 of your preprint pdf states you used --mimicBT2 and --useEM for SA and SAF quantification methods. Is this the recommendation while using SA and SAF methods? From salmon v01.4.1 with SA method, I have all along used the default VBEMand --validateMappings based on info in SalmonReadTheDocs

Enables selective alignment of the sequencing reads when mapping them to the transcriptome. This can improve both the sensitivity and specificity of mapping and, as a result, can improve quantification accuracy

  1. Related to Q5 above - I assume you used --mimicBT2 and --useEM for SA and SAF methods so that you can compare to RSEM. Have you used the defaults for salmon quant for SA and SAF - or in other words, how does SA/SAF results compare against other methods listed in the preprint when you do not use the --mimicBT2 and --useEM options

Thanks in advance,

k3yavi commented 4 years ago

Hi @tamuanand ,

I think these are very important question and thanks for raising the issue. As you mention, In the preprint we put out two different modes of Selective Alignment: A) SA: The mashmap and bedtools based pipeline which follows old SalmonTools based pipeline. B) SAF: Inbuilt salmon pipeline to consume genome and follows this pipeline.

The distinction between the two comes from how the decoy sequence are actually generated. To answer your question point wise. 1.) That's correct SAF based pipeline follows the tutorial as mentioned in B above and uses the full genome as decoys. 2.) That's correct, if a user wan't to run SA method, then they should follow mashmap based tutorial A. This might be useful for situation where the index is too big to fit into the machine's memory. 3.) That's also correct, yes if you don't provide decoys -d you can still run salmon on the transcriptome. We have just enabled the validateMapping option by default, which is also used in transcriptome only mode, currently there is no option to disable it. 4) That's also correct, we have dropped the quasi-mapping based support from the latest version, If you need to run quasi we have released 0.15 just as a last version into the archive. 5 & 6) Very good question, short answer is your default pipeline of VBEM is the recommended way. We have to use additional flags --mimicBT2 and --useEM while comparing the methods in the preprint. RSEM can only do EM and as we were comparing against Bowtie2 we have to mimic it with more stricter requirements for fair comparison. We expect the performance to be better with VB based optimization and not using mimcBT2 .

@rob-p Feel free to add if I missed something.

tamuanand commented 4 years ago

Hi @k3yavi

Can you elaborate on 3 and 4 with command line usage examples? I feel you misunderstood my Question 3

My question 3 was "how to do salmon index if I do not have a genome" with salmon/v1.0

If my understanding is right (based on your response 4 - about dropping quasi mapping support with salmon/v1.0), I believe you cannot use salmon/v1.0 to do something like this below for my Question 3 (salmon index in absence of genome) ; salmon index -t txome_fasta -i txome_index

Other questions:

  1. I don't believe bioconda has salmon/v1.0 - I checked on 01-Nov-2019 (around 7am Eastern)
  2. Is salmon v0.15.0 available via bioconda - when I tried (same time as above) updating salmon via bioconda channel on my conda env it still pointed me to 0.14.1

Thanks in advance to both @k3yavi and @rob-p

k3yavi commented 4 years ago

Right, in short salmon index -t txome_fasta -i txome_index should work and both the versions of salmon (v0.15 and v1.0) is available on bioconda, check here, you may wanna try force update of conda.

I think the confusion is you are thinking of the concept of Selective Alignment as the same as aligning to transcriptome w/ decoys (can be genome or mashmap based). Although they are related methods but the concept of Selective Alignment predates the idea of decoy based alignment, checkout this paper from our lab where we discuss how Selectively Aligning difficult reads to just the transcriptome itself can result in improved quantification estimates compared to quasi or pseduo alignment.

To summarize: In version 1.0 A) SA: The mashmap and bedtools based pipeline which follows old SalmonTools based pipeline. B) SAF: Inbuilt salmon pipeline to consume genome and follows this pipeline. C) If you don't provide any decoys, salmon will do Selective Alignment just on the transcriptome. The Release notes you quoted just means you cannot disable this feature i.e. you cannot fall back to quasi-mapping (in quasi mapping there is no alignment of the reads at all).

In version 0.15.0 You cannot provide decoys and the transcriptome based mapping performed in this version would be quasi-mapping i.e. no Alignment of reads.

Hope it helps .

rob-p commented 4 years ago

Hi @tamuanand,

Thank you for the detailed questions! Let me elaborate a bit on a few of @k3yavi's answers.

1&2) Yes; if you want to use SAF, you no longer need mashmap, as what you are essentially doing is treating the entire genome as a "decoy". As @k3yavi alludes, SA is still useful when you need to run in a very memory-constrained environment. After adopting the new pufferfish-based index, the size of the transcriptome plush mashmap 2 decoys becomes considerably smaller than the previous size of the transcriptome in earlier versions of salmon (<= 0.15.0). However, depending on the organism, indexing the entire genome as decoy, even though it yields the best accuracy, does require a bit more memory, as specified in the release notes for the 0.99 betas and 1.0.0.

3) Yes; it is still possible to use salmon index without any decoy sequence. In this case, one can expect results similar to if you had aligned to the target transcriptome using Bowtie2. In this case, you perform indexing by simply not providing any --decoy flag to the index command. In that case, all of the records in the target fasta will be treated as valid and quantifiable targets. Of course, for reasons detailed in the pre-print --- the high sensitivity of both Bowtie2 and selective-alignment --- we recommend including either mashmap-derived decoys or the organism's genome as a decoy whenever possible.

4) Related to @k3yavi's response and my elaboration above: we have dropped quasi-mapping from 1.0.0 (though something akin to it may return in the future if there is sufficient demand and if the shortcomings described in the manuscript can be overcome). However, as I mention in part 3 above, this doesn't mean it's not possible to use v1.0.0 without an explicit decoy sequence. The --decoy flag of the indexing command is optional, not required. We will update this in the documentation making it more explicit. However, as @k3yavi points out, it is true that if you wish to use quasi-mapping and selective-alignment against the full genome on the same machine, you will need both versions, as quasi-mapping is supported only in the RapMap, while indexing something on the scale of the genome when not using the pufferfish-based index has tremendous memory requirements (as is not recommended ).

5 & 6) To re-iterate @k3yavi's answer --- the extra flags used in the pre-print were only for the purpose of holding as many variables fixed as possible when comparing different approaches. It continues to be recommended to use the VBEM over the EM; it seems to perform better with respect to the ways in which we can measure and such improvements have also been documented in other work. The main effect of --mimicBT2 is to discard orphan alignments for the purposes of quantification. This is a more strict requirement than the default behavior of allowing orphans if there is no satisfactory alignment of both ends of a fragment. However, there is no obvious reason why it is better behavior than accounting for these orphan fragments (when appropriately adjusting the conditional probability given their distance from the transcript boundaries, as salmon does).

tamuanand commented 4 years ago

@rob-p and @k3yavi

Thanks for your answers.

Suggestion: It will be great if the getting started with Salmon document is updated to reflect all possible scenarios you list out above.

It might also be better to pull off that document till it gets updated OR if it is redirected to something more pertinent.

tamuanand commented 4 years ago

@rob-p @k3yavi

With release of salmon/v1.0, do you have any recommendations for salmon quant command line for QuantSeq

Many users (me included) would like to hear from you to the question posted here