Salmon and RSEM TPM discrepancy

I recently quantified a set of RNA-seq samples using both Salmon and RSEM to eventually do differential isoform analysis. When the overlap of significant isoforms was rather low -- even though the only difference was whether the TPM values were from Salmon or RSEM -- I started comparing the TPM values and saw something strange.

After taking the log2 of the TPM values, I generated these plots. TPM values for salmon are from quant.sf files and for RSEM, *.isoforms.results files.

The there three very consistent commonalities between these plots:

There seems to be a large number of transcripts with zero expression in one but not the other.
There seems to be a set of isoforms that is more highly expressed in Salmon across all plots.
The bulk of reads that are close to the diagonal line are mostly concentrated on the RSEM side

And the details of the runs:

RSEM v1.3.2 commands:

indexing:

rsem-prepare-reference --gtf gencode.vM25.annotation.gtf --star --star-path /appl/soft/bio/star/gcc_9.1.0/2.7.2a/bin -p 8 --star-sjdboverhang 83 GRCm38.primary_assembly.genome.fa rsem_indx_gencode/rsem_ref

quant:

rsem-calculate-expression -p 8 --strandedness forward --star --star-path /appl/soft/bio/star/gcc_9.1.0/2.7.2a/bin --star-gzipped-read-file input_file.fq.gz rsem_indx_gencode/rsem_ref output/path

salmon v1.4.0 commands:

indexing:

used entire genome as decoy sequence, decoy file made following instructions from: https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/ (except that latest M25 release was used).
salmon index -t gentrome.fa.gz -d decoys.txt -i salmon_index --keepDuplicates --gencode -p 12

quant:

salmon quant -i salmon_index -l A -r /path/to/input.fq.gz -g gencode.vM25.annotation.gtf -p 4 --validateMappings -o path/to/output

Any idea what could be going on here?

Hi @rmurray2,

Thanks again for the detailed question (I answered them in reverse order, so that's why I'm saying "again" here). There are a few things going on that could be leading to differences. They are, in the order I think they will have an effect on the result:

You are using RSEM in a mode that is mapping the reads to the entire genome (using STAR) and then projecting the resulting alignments to the transcriptome. You are using salmon in a way that is performing selective alignment against the transcriptome only. We have recently published a paper discussing in detail the effect that some of these choices can have on transcript and gene-level abundance estimation. In general, if you don't include the genome as a mapping target, depending on your sample, there may be certain reads that are assigned to the transcriptome even though they have a better alignment to some other genomic location. This is independent of e.g. salmon and RSEM, and you'd observe the same thing if you ran RSEM using e.g. Bowtie2 as the aligner aligning against the transcriptome. Luckily, you can control this source of variation. Salmon, like RSEM, can accept alignments to the transcriptome produced by STAR. If you want to see how big of an effect this is having in your sample, you can align reads to the genome using STAR (and project them to the transcriptome) to produce a BAM file that salmon can quantify. You can check RSEM's script to see exactly how it invokes STAR, but the parameters are something like --outFilterType BySJout --alignSJoverhangMin 8 --outFilterMultimapNmax 20 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --eadFilesCommand zcat --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD --quantTranscriptomeBan IndelSoftclipSingleend; note that last parameter that I will come back to later. Also, the paper referenced above also describes a new capability present in recent versions of salmon that allow it to index the entire genome (as well as the transcriptome) to have the former act as a decoy. This allows avoiding what might otherwise be spurious mappings that result when one considers only the transcriptome as a source of mapping. There are a number of ways to proceed on this front, but this is a good place to first check for discrepancy (and the paper gives a good overview of the relative tradeoffs and merits of different alignment approaches).

Salmon and RSEM use related but distinct optimization algorithms by default. RSEM uses the EM algorithm, and salmon uses the variational Bayesian EM algorithm. The latter tends to induce more sparse solutions. This is simply because they are optimizing slightly different objectives. It is very difficult to say in general if one is "better" than the other in a blanket way, but there is previous literature to support that the VBEM may be more accurate. However, while RSEM only implements the EM algorithm, salmon actually implements and provides a switch to use either. So, if you want to test the effect of this difference, you can run salmon with the --useEM algorithm. This will tell salmon to use the "classic" EM algorithm and will eliminate this source of variation.
As with the other question you asked, there may be a small discrepancy depending on when enforcement of a stranded library kicks in under salmon's A library type. You can eliminate that variable by simply providing -l SF to match the library type being used with RSEM.
Coming back to the IndelSoftclipSingleend parameter I mentioned in the first point; RSEM disallows indels in the alignments that it quantifies. This means that to produce RSEM-compatible input, STAR must not align reads that contain indels. While this won't generally have a big effect for many transcripts, it can certainly affect the abundance estimates for transcripts in your sample where the sample you are quantifying has (indel) variation with respect to the reference annotation. We touch upon that a bit as well in the paper I mentioned above.
Finally, and likely the smallest source of potential differences, is that there are other implementation details that differ between salmon and RSEM (e.g. exactly how the fragment length distribution is used to compute the effective transcript length, exactly how the alignment score of a read is used to ascertain conditional fragment assignment probabilities, etc.). Even when you set the parameters to make the methods as concordant as possible, you will still observe some differences. These will generally be much smaller and much more constrained than the differences you see in your plots, but they nonetheless exist since these are different methods.

These are the biggest potential sources that I can currently imagine for the differences you are seeing. I would recommend exploring them in approximately this order. If you run into any issues or have any questions as you investigate, please don't hesitate to follow up here or reach out.

COMBINE-lab / salmon

Salmon and RSEM TPM discrepancy #612