COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
772 stars 162 forks source link

Mapping rate discrepancy by library type #779

Open jonahcullen opened 2 years ago

jonahcullen commented 2 years ago

Hello, before writing further, I want to say that I have read through previous related open/closed issues but could not find a suitable answer.

The problem: rather different mapping rates depending on library type.

I am working with 150+ RNA-seq samples from two dozen tissues. I generated a novel, non-redunant transcriptome from these samples (which also includes the known annotation from ensembl) using ryuto. Following filtering of likely artifacts, predicting non-coding, and lowly expressed transcripts (identified via salmon v1.8.0), a final transcriptome is generated. The selective alignment index (with decoys) is then used for a second and final round of quantification as

salmon quant \
    -p {threads} \
    --seqBias \
    --gcBias \
    -i {input.sal_idx} \
    -l A \
    --numGibbsSamples 100 \
    --dumpEq \
    -1 {r1} \
     -2 {r2} \
    -o {params.out_dir}

This is a rather heterogeneous collection of RNA-seq data with ~30% polyA selected and unstranded and ~70% rRNA-depleted and stranded so I expect some interesting results along the way. Included in this collection are a handful of samples with both library preps and this is where I first noticed the problem above. For example, one sample has a polyA mapping rate of 84.01% and a ribo-dep of 46.67% (portion of each _metainfo.json below). Both of these however had STAR uniquely mapped reads >90%. Overall, the polyA samples have salmon mapping rates between ~75-90% while the ribo-dep at ~35-60%. I checked for rRNA contamination with RSeQC for a couple of the ribo-dep samples but did not observe anything significant to explain the decrease.

I also want to note that these libraries were generated years apart and sequenced at different facilities so I am quite sure that is introducing a fair amount of variance. I guess I still I would have expected more similar mapping rates but apologies if this is the most obvious likely culprit!

My questions are

  1. What could be explaining this difference on the whole or between individuals with both preps?
  2. Are mapping rates between 35-60% generally considered acceptable? There are at least 10+ million reads mapped for every sample.

polyA.meta_info.json

    "num_processed": 29574448,
    "num_mapped": 24847168,
    "num_decoy_fragments": 876840,
    "num_dovetail_fragments": 186218,
    "num_fragments_filtered_vm": 1479709,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 8017093,
    "percent_mapped": 84.01566108689501,

ribo-dep.meta_info.json

    "num_processed": 19428407,
    "num_mapped": 9067403,
    "num_decoy_fragments": 2027057,
    "num_dovetail_fragments": 120937,
    "num_fragments_filtered_vm": 2303744,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 4366133,
    "percent_mapped": 46.67085160404556,

Thank you very much for your time and consideration!

gpertea commented 1 year ago

In my experience it is normal to get a much lower transcriptome mapping rate for rRNA-depleted samples vs polyA-selected samples.

I'm getting ~21% mapping rate (using Gencode 41 transcripts) on human brain RNAseq samples sequenced several years ago using rRNA-depletion protocol (older Illumina Ribo Zero kits). I was initially shocked (being used to seeing >90% mapping rates from HISAT2/STAR for these samples) but it turns out this is normal for this kind of samples, in this context. HISAT2 reports 96% mapping rate on the same samples, but QC metrics (rnaseqc) for these HISAT2 alignments (using the same Gencode annotation) show a 65% intronic rate and a %23.5 exonic rate (the rest being intergenic etc). So the exonic rate is getting close to what Salmon is showing (and what it measures), thus I suppose it makes sense to see such a low mapping rate for Salmon on these samples. (kallisto also reports ~21% pseudoaligned percentage on the same samples)

I am only a bit disappointed that when I use --validateMappings with decoy sequences (whole genome) added, the mapping rate goes down to about 16.7% -- as some reads map better to the decoys in that case (partially intronic reads etc.), but I also see a higher number of fragments entirely discarded because of alignment score (higher num_fragments_filtered_vm and much higher num_alignments_below_threshold_for_mapped_fragments_vm).

rob-p commented 1 year ago

Hi @gpertea,

One thought about losing reads to the decoys in this case is that the default strategy is to assign as decoy any read that maps strictly better to the decoy than the target (transcriptome). So, even if e.g. the intron describes a single extra base, then the read gets assigned as decoy rather than transcriptomic. This is actually something that is easy to customize the behavior of (i.e. to add some "slack" so that reads have to map better to the decoy by some threshold before being assigned to the decoy).

Out of curiosity, what is salmon's mapping rate on this sample without the decoy? When you add the decoy, how many reads are assigned to the decoy sequence? Not that I necessarily suspect anything awry, but we'd be interested in taking a look at such samples anyway; what's the accession for the one you mention above?

Thanks! Rob

gpertea commented 1 year ago

Thank you @rob-p for commenting, I presumed that could be the case for reads from unspliced pre-mRNAs that are even extending a small fraction into the introns (hence better scoring on the decoys). The 2 FASTQ files for one of the samples I was describing above can be found as R4171*.fastq.gz at this globus link: http://research.libd.org/globus/jhpce_bsp2-dlpfc/index.html. I used just the main chromosomes with Gencode v41 annotation (slightly "curated" to remove read-through and "retained intron" annotated transcripts).

I am attaching 3 meta_info.json outputs for the 3 ways I ran salmon on this sample:

It would be great to be able to use Salmon's "wicked fast" mapping engine to estimate intronic and intergenic reads at the same time, so I'm considering to make better use of the writeMappings output for that purpose, by preparing the decoys in a specific way (extracting intronic and intergenic sequences as distinctively labeled decoys and count the mappings to each label from Salmon's SAM output -- would that work?)

I am wondering, due to pre-mRNAs found in rRNA-depletion (ribo-zero) samples, it might be better to artifically add the unspliced transcripts into the mix along with the "reference" annotation transcripts, so they also get quantified during the EM-guided probabilistic distribution of reads across this mix of pre-mRNAs + mature RNAs in each locus.. What do you think?

pschumann4 commented 8 months ago

Hello, I have a similar question on this topic. I have a single-end sequenced forward stranded library. When I quantified to the Salmon indexed transcriptome using --libType SF, I got a mapping rate of 35.8655%, but when I changed the --libType to "A", I got a mapping rate of 58.7005%. The resulting estimated counts were identical, however.

What might explain this?

Also, I should note, Salmon correctly determined that the libType should be SF when using its automatic feature.

Thanks!