COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
762 stars 160 forks source link

odd case of unexpressed transcript in some samples getting too many counts... #133

Open jdrnevich opened 7 years ago

jdrnevich commented 7 years ago

Hello,

I've been comparing Salmon (v 0.8.2) counts with those of traditional STAR/featureCounts and found one bizarre example. The unique aligned counts and the Salmon counts (rounded) for my 3 sets of 3 samples are:

        C_1 C_2 C_3 pl_C_4 pl_C_5 pl_C_6 CpG_7 CpG_8 CpG_9
STAR     0   1   1     70     31     41     2     0     0
Salmon 147 139   1     80     41     47     2    76    95

From the STAR counts, it looks like this gene is only expressed in the middle group of 3 samples, But Salmon has given 4/6 other samples large numbers of counts. There is only one transcript for this gene, which produces RING finger protein (large gene family). The estimated lengths for the transcript are:

  C_1     C_2    C_3   pl_C_4  pl_C_5  pl_C_6   CpG_7   CpG_8   CpG_9 
250.0   250.0 17385.6 17381.2 17308.9 17356.8 17415.0   250.0   250.0

So those 4/6 with large numbers of counts have very short estimated gene lengths, which I'm guessing is just shared sequence with other RING finger protein genes. The two samples that had only 1 and 2 unique reads must have had them in a place that led Salmon to estimate the proper long transcript length, and to prevent the EM algorithm from assigning multiply mapping reads to the shared 250 bp. The combination of ~100 reads plus very short length leads to huge, incorrect abundances for these 4 samples! This is the only example I can find in the whole data set, so it must be a weird combination of the transcript only being expressed at a low-level in one group plus a large gene family. Is there anyway this could be avoided by your algorithm? Thanks, Jenny

rob-p commented 7 years ago

Hi Jenny,

Thanks for this detailed report. I'd be interested in taking a look at some of the data if it is public or possible to share for the purposes of testing. It is true that if a read is equally explicable by a short and long transcript (and if there is not a lot of other read evidence of the long transcript), the inference algorithm will prefer to assign the read to the shorter isoform, since this will increase the overall likelihood of the observed sequenced fragments. Of course, one could always run salmon's Gibbs sampler --numGibbsSamples or bootstrap sampler --numBootstraps to determine the variance in these point estimates. I'm on travel for the next couple of days, but will be happy to look into this more deeply when I return.

Best, Rob

DarioS commented 4 months ago

I am also seeing this. The most highly estimated and shortest isoform of KIT has no supporting read but a TPM of 40.

image

Using DRAGEN RNA version 4.2.4 which I think wraps Salmon. In any of your BAM files?