Difference in quantification using alignment-based mode

reganhayward commented 3 years ago

We are using Salmon to help allocate multimapped reads in dual RNA-seq data. What this means is we combine a host and pathogen genome (bacterial genome + host transcriptome) and also use a combined annotation file as well. Basically, we can then find out which reads uniquely map to each gene/transcript from both organisms.

The issue we are having is when using Salmon in alignment-based mode (using STAR), we are observing quite substantial variations in how ambiguous (multimapping) reads are being assigned.

We use the same quant command/params and the same input files - and we run the same data twice in different output locations. The correlations between quant files are generally really good ~ 99.9x (spearman).

However, the number of allocated reads to a small proportion of transcripts varies quite a lot.

For example:

ENST00000383869.1 goes from 7699.89 reads to 4871.19 between runs
ENST00000374675.7 goes from 1326.02 reads to 0 between runs

We initially noticed this when examining differentially expressed genes downstream, which were a bit unusual.

These two transcripts in particular have a high number of ambiguous counts (generally >1000) when examining the ambig_info.tsv file (the number of ambig reads does remain the same though)

I tried additional params and runs including: Accounting for biases: --seqBias --gcBias --posBias Including bootstrap information: --numBootstraps 100 --bootstrapReproject

Both params reduced the variation between runs slightly, but not enough to reduce the disparity.

We don't see this difference when running in selective alignment mode

We are using v.1.3.0 and single end reads.

STAR --runThreadN 30 --runMode genomeGenerate --genomeDir index/ --genomeFastaFiles $fasta --sjdbGTFfile $gff --sjdbGTFfeatureExon exon --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang $sjdbOverhang

STAR --runThreadN 30 --genomeDir . --sjdbGTFfile $gff $readFilesCommand --readFilesIn $reads --outSAMtype BAM Unsorted --outSAMunmapped "Within" --outSAMattributes "Standard" --outFileNamePrefix $sample_name/$sample_name --sjdbGTFfeatureExon quant --sjdbGTFtagExonParentTranscript parent --quantMode TranscriptomeSAM --quantTranscriptomeBan "Singleend" --outFilterMultimapNmax 999 --outFilterType "BySJout" --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --winAnchorMultimapNmax 999

salmon index -p 30 -t gentrome.fa -i transcripts_index --decoys decoys.txt -k 21
salmon quant -p 30 -t host_pathogen_transcriptome.fa -l A -a bam_file.bam --incompatPrior 0.0 -o output_file

rob-p commented 3 years ago

Hi @reganhayward,

Thank you for the detailed report. It's interesting that this happens when running with STAR but not when running with selective alignment. However, salmon will attempt to solve the optimization problem with the alignments it is given, regardless of if those come from STAR or from it's built-in selective alignment. While I would generally expect these to be similar, the alignment algorithms are different; see e.g. the differences between SA/SAF & STAR here.

Nonetheless, it is possible that for a small subset of transcripts, the probabilistic allocations are so ambiguous, that you get large swings in the resulting quantification estimates based on tiny variations in where the optimization starts (which is, itself, stochastic due to the asynchronous nature of salmon's online inference phase).

One way we can test this hypothesis is as follows. You can run salmon with --numGibbsSamples 100 and -d. This will tell salmon to perform posterior Gibbs sampling (--numGibbsSamples 100) and to dump the range-factorized equivalence classes used for offline quantification (-d). The Gibbs sampling files will contain the traces for the transcripts in question over the various iterations of the sampling procedure. Transcripts where there is a tremendous amount of ambiguity will tend to have highly anti-correlated posterior samples, and similarly, if you were to consider the abundance output of these transcripts as a group, there would be a large reduction in inferential relative variance. In fact, we wrote a whole paper on this topic. Consider this example from that paper:

The x-axis is samples from the Gibbs chains, and the y-values denote the estimated number of reads assigned to both transcripts in each sample. The green line at the top is what you get if you sum the abundances of these two transcripts. The main point is that the inferential relative variance (adjusted ratio of the variance over the mean) is much smaller for the sum of these transcripts than for either individually. This is strong evidence that they are inherently uncertain given the read evidence and alignments used for quantification. The tool described in that paper, called terminus, is a tool for automatically finding such groups of transcripts.

Anyway, once you have the Gibbs samples in hand, we can walk you though how to do some assessment of these transcripts (tagging @hiraksarkar here since he's most likely to have access to scripts that will let us look at the posterior samples from individual transcripts). Similarly, if you can provide the quantification directory, we can help examine this too. If this is the case, that the posterior distributions are highly anti-correlated, it is likely that the ambiguity you are seeing is simply inherent given the alignments salmon is being provided. If you have the quantification folder resulting from the same sample using selective alignment, we could compare and contrast the two.

At that point, there are a few options depending on how deeply you want to dive. You could try to see how STAR and selective alignment are mapping differently to these transcripts. One potential difference is that STAR is a lot more happy to softclip reads, which selective alignment won't do by default (you can test the effect with the --softclipOverhangs to allow selective alignment to softclip reads that hang off the transcript end or --softclip to allow softclips anywhere). Note that selective alignment may still be a bit more conservative than STAR about softclips simply because of the nature of the scoring function it uses. This might give you a sense if one of these alignment methodologies is more consistent with your expectations in this case. Another option is to consider doing a grouping with terminus. This will reduce the set of "genes" that you can call as DE, because it will be happy to group together transcripts from different genes. However, it should help considerably in eliminating DE from highly-uncertain point estimates. Finally, you might consider performing DE with swish (cc @mikelove as he might have some input here) rather than DESeq2 (though we've typically been using swish at the transcript level rather than the gene level). Unlike DESeq2, swish will explicitly take into account the inferential uncertainty in the abundance estimates, using the Gibbs samples produced by salmon. This will allow it to avoid spurious DE calls that might otherwise occur when you have highly uncertain transcripts that, by chance, end up being assigned very different abundances in different samples / over different runs.

Sorry for the information dump, but I wanted to lay out what might be going on, how to assess it, and what some potential solutions might be. If you dive in to start investigating this, feel free to reach out in this issue along the way if you get stuck or have follow-up questions.

mikelove commented 3 years ago

Re: Swish, it's fairly similar to running DESeq2 or other Bioc methods:

y <- tximeta(coldata) # import data
y <- scaleInfReps(y) # scales estimated counts
y <- labelKeep(y) # low count filtering
set.seed(1)
y <- swish(y, x="condition") # DTE or DGE

reganhayward commented 3 years ago

Hi @rob-p and @hiraksarkar,

I've attached a link to the output from 4 runs as requested - same input files and params as previously mentioned, plus adding --numGibbsSamples 100 and -d I wasn't able to include the bootstrap folder due to the size. I can provide a separate link if required.

Alignment-based with STAR - 2 runs Selective alignment - 2 runs

A couple of transcripts we focused on were ENST00000374675.7 and ENST00000428189.5

For example:

--Selective alignment--
Transcript      NumReads_Run1   NumReads_Run2
ENST00000374675.7   1500.354    1500.354
ENST00000428189.5   441.447     441.447

--Alignment based--
Transcript      NumReads_Run1   NumReads_Run2
ENST00000374675.7   1501.394    0
ENST00000428189.5   866.566     0

If you can let me know what next steps you'd recommend to check where the ambiguity is coming from

Thanks!

rob-p commented 3 years ago

Thanks @reganhayward. I'm pinging @hiraksarkar here as well as he can help us dig into this. I do think it will be really useful to have the bootstrap (in this case Gibbs) folders for the runs, so we can load that data up and see what the posterior traces look like for these transcripts. If you can throw that up somewhere, then we can grab it as well.

hiraksarkar commented 3 years ago

@rob-p thanks for tagging me. I missed the primary discussion and went over it. @reganhayward , thanks for posting this interesting phenomenon. Checking for ambiguity is data driven, there are many sources of ambiguity,

Exon sharing (when both the transcripts share a splicing even and thereby share an exon.)
They originate from genes that ae paralogs of each other.
They share sequence of read length without an obvious clue (might stem some evolutionary event in the past.)

Looking into the ensemble database ENST00000374675.7 originates from ENSG00000112473 and ENST00000428189.5 belongs to ENSG00000229470.5. The second one comes from an alternate assembly according to ensemble from the same chromosome 6. There is a possibility that there is exon sharing. But I can not be sure before doing a blast etc.

There is a way from salmon to track this down. If you use -d option for dumping equivalence classes then, we can check for equivalence classes where these transcripts occur. If they co-occure in an equivalence class then we could be sure that it is due to read sharing. If you can provide me the equivalence class file (should be a smaller one), I can look into it. The gibbs/bootstraps would come handy once we know the equivalence classes, since it will reveal how the uncertain the estimates are for these two transcripts when compared to their equivalence class members.

Best Hirak

rob-p commented 3 years ago

Hi @hiraksarkar ,

The results @reganhayward included above are with -d, so we have the equivalence classes. The gibbs samples were big, so not included, but I’ve asked him to put them somewhere we can get them. With that, I think we can help look into what is causing this.

hiraksarkar commented 3 years ago

Hi @rob-p ,

Great, I missed the -d option before, looking into it now.

hiraksarkar commented 3 years ago

Hi @reganhayward ,

I just did an analysis on the equivalence class you reported. Both of them are connected to other transcripts, connecting to ENST00000374675.7

['ENST00000444757.5',
 'ENST00000374677.8',
 'ENST00000418477.6',
 'ENST00000465516.1',
 'ENST00000486338.5',
 'ENST00000463972.1']

and connecting to ENST00000428189.5

 ['ENST00000383450.3',
 'ENST00000441604.5',
 'ENST00000458592.1',
 'ENST00000417834.5',
 'ENST00000376621.7',
 'ENST00000462708.1',
 'ENST00000465483.1',
 'ENST00000433809.1',
 'ENST00000456550.1',
 'ENST00000450423.1',
 'ENST00000464231.1',
 'ENST00000492408.1',
 'ENST00000487166.1']

For STAR alignment they are,

 ['ENST00000374677.8',
 'ENST00000463972.1',
 'ENST00000423043.6',
 'ENST00000418477.6',
 'ENST00000486338.5',
 'ENST00000465516.1',
 'ENST00000474514.1',
 'ENST00000416369.6',
 'ENST00000441854.6',
 'ENST00000477713.5',
 'ENST00000497713.1',
 'ENST00000486222.1',
 'ENST00000443773.6',
 'ENST00000456261.6',
 'ENST00000487655.1',
 'ENST00000493496.5',
 'ENST00000460361.1',
 'ENST00000441953.6',
 'ENST00000431735.6',
 'ENST00000462670.1',
 'ENST00000491795.5',
 'ENST00000468378.1',
 'ENST00000383214.8',
 'ENST00000383213.8',
 'ENST00000486373.5',
 'ENST00000491702.1',
 'ENST00000478834.1',
 'ENST00000444757.5',
 'ENST00000429042.5',
 'ENST00000454420.5',
 'ENST00000425069.5',
 'ENST00000414757.5',
 'ENST00000414916.5']

and, the other transcript is connected to

 ['ENST00000376621.7',
 'ENST00000487166.1',
 'ENST00000383450.3',
 'ENST00000497332.1',
 'ENST00000441604.5',
 'ENST00000481420.1',
 'ENST00000487687.1',
 'ENST00000454829.5',
 'ENST00000490227.1',
 'ENST00000437917.5',
 'ENST00000481380.1',
 'ENST00000383596.6',
 'ENST00000488456.1',
 'ENST00000417834.5',
 'ENST00000486264.1',
 'ENST00000462708.1',
 'ENST00000478748.1',
 'ENST00000465483.1',
 'ENST00000478986.1',
 'ENST00000480572.1',
 'ENST00000497917.1',
 'ENST00000469494.1',
 'ENST00000489631.1',
 'ENST00000433809.1',
 'ENST00000456550.1',
 'ENST00000450423.1',
 'ENST00000435788.1',
 'ENST00000416639.1',
 'ENST00000443235.1',
 'ENST00000458592.1',
 'ENST00000430236.1',
 'ENST00000464231.1',
 'ENST00000496960.1',
 'ENST00000492408.1',
 'ENST00000479883.1',
 'ENST00000467241.1',
 'ENST00000483987.1',
 'ENST00000495838.1',
 'ENST00000467550.1']

Clearly, in the alignment-based is connected to a lot of other transcripts connected, so their different behaviors is expected. I think this makes the solution so unstable that EM assigns all the reads to one rather than distributing them to other members. We need to look at the actual bootstrap/gibbs to have more insight.

I would also like to add, as @rob-p suggested previously, this is a classic example where EM algorithm is not that reliable, b/c of uncertainty and terminus might be the best answer.

Here is the script for constructing the graph from the equivalence class file.

Best Hirak

reganhayward commented 3 years ago

Thanks for helping to look into this @hiraksarkar That seems like a logical outcome re the higher number of connected transcripts - interesting

As requested - I've added in the bootstrap folder - here are the new links: Selective alignment Alignment based - STAR

reganhayward commented 3 years ago

Hey guys - sorry to ask, but have you had any progress looking into this?

COMBINE-lab / salmon

Difference in quantification using alignment-based mode #600