Open rmurray2 opened 3 years ago
Hi @rmurray2,
Thanks again for the detailed question (I answered them in reverse order, so that's why I'm saying "again" here). There are a few things going on that could be leading to differences. They are, in the order I think they will have an effect on the result:
--outFilterType BySJout --alignSJoverhangMin 8 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --eadFilesCommand zcat --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan IndelSoftclipSingleend
; note that last parameter that I will come back to later. Also, the paper referenced above also describes a new capability present in recent versions of salmon that allow it to index the entire genome (as well as the transcriptome) to have the former act as a decoy. This allows avoiding what might otherwise be spurious mappings that result when one considers only the transcriptome as a source of mapping. There are a number of ways to proceed on this front, but this is a good place to first check for discrepancy (and the paper gives a good overview of the relative tradeoffs and merits of different alignment approaches).Salmon and RSEM use related but distinct optimization algorithms by default. RSEM uses the EM algorithm, and salmon uses the variational Bayesian EM algorithm. The latter tends to induce more sparse solutions. This is simply because they are optimizing slightly different objectives. It is very difficult to say in general if one is "better" than the other in a blanket way, but there is previous literature to support that the VBEM may be more accurate. However, while RSEM only implements the EM algorithm, salmon actually implements and provides a switch to use either. So, if you want to test the effect of this difference, you can run salmon with the --useEM
algorithm. This will tell salmon to use the "classic" EM algorithm and will eliminate this source of variation.
As with the other question you asked, there may be a small discrepancy depending on when enforcement of a stranded library kicks in under salmon's A
library type. You can eliminate that variable by simply providing -l SF
to match the library type being used with RSEM.
Coming back to the IndelSoftclipSingleend
parameter I mentioned in the first point; RSEM disallows indels in the alignments that it quantifies. This means that to produce RSEM-compatible input, STAR must not align reads that contain indels. While this won't generally have a big effect for many transcripts, it can certainly affect the abundance estimates for transcripts in your sample where the sample you are quantifying has (indel) variation with respect to the reference annotation. We touch upon that a bit as well in the paper I mentioned above.
Finally, and likely the smallest source of potential differences, is that there are other implementation details that differ between salmon and RSEM (e.g. exactly how the fragment length distribution is used to compute the effective transcript length, exactly how the alignment score of a read is used to ascertain conditional fragment assignment probabilities, etc.). Even when you set the parameters to make the methods as concordant as possible, you will still observe some differences. These will generally be much smaller and much more constrained than the differences you see in your plots, but they nonetheless exist since these are different methods.
These are the biggest potential sources that I can currently imagine for the differences you are seeing. I would recommend exploring them in approximately this order. If you run into any issues or have any questions as you investigate, please don't hesitate to follow up here or reach out.
I recently quantified a set of RNA-seq samples using both Salmon and RSEM to eventually do differential isoform analysis. When the overlap of significant isoforms was rather low -- even though the only difference was whether the TPM values were from Salmon or RSEM -- I started comparing the TPM values and saw something strange.
After taking the log2 of the TPM values, I generated these plots. TPM values for salmon are from
quant.sf
files and for RSEM,*.isoforms.results
files.The there three very consistent commonalities between these plots:
And the details of the runs:
RSEM v1.3.2 commands:
indexing:
quant:
salmon v1.4.0 commands:
indexing:
used entire genome as decoy sequence, decoy file made following instructions from: https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ (except that latest M25 release was used).
salmon index -t gentrome.fa.gz -d decoys.txt -i salmon_index --keepDuplicates --gencode -p 12
quant:
Any idea what could be going on here?