COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Low Mapping Rate #814

Open umutcakir opened 1 year ago

umutcakir commented 1 year ago

Dear community,

I have 4 lane files for each sample. I want to run Alevin on these samples. Alevin runs without any error, but the mapping rate is too low (about 5%). I tried different k-mers, different versions of transcriptome datasets, and all protocols (--dropseq / --chromium / --chromiumV3), but my mapping rate is still too low.

Let me explain my workflow. These are my samples:

Sample1_S1_L001_I1_001.fastq.gz  Sample1_S1_L002_R1_001.fastq.gz  Sample1_S1_L003_R2_001.fastq.gz
Sample1_S1_L001_R1_001.fastq.gz  Sample1_S1_L002_R2_001.fastq.gz  Sample1_S1_L004_I1_001.fastq.gz
Sample1_S1_L001_R2_001.fastq.gz  Sample1_S1_L003_I1_001.fastq.gz  Sample1_S1_L004_R1_001.fastq.gz
Sample1_S1_L002_I1_001.fastq.gz  Sample1_S1_L003_R1_001.fastq.gz  Sample1_S1_L004_R2_001.fastq.gz

I generated single I1, R1 and R2 files by concatenating fastq.gz files across lanes; for instance, Sample1_S1_L001_R1_001.fastq.gz + Sample1_S1_L002_R1_001.fastq.gz + Sample1_S1_L003_R1_001.fastq.gz + Sample1_S1_L004_R1_001.fastq.gz -> Single_S1_L001_R1_001.fastq.gz

I indexed my transcriptome by salmon, and I used Alevin using the indexed transcriptome. I used the following command:

salmon alevin -l ISR \
-1 "Single_S1_L001_R1_001.fastq.gz" \
-2 "Single_S1_L001_R2_001.fastq.gz"  \
--chromiumV3  \
-i $my_index \
-p 48 \
-o outputfolder \
--tgMap $my_tgmap \
--expectCells 10000 \
--forceCells 10000

However, my mapping rate is only ~5%. I used cell ranger on the same data and the same transcriptome; the mapping rate is ~95%. Is there any bug on Alevin? How can I improve the mapping rate?

Thank you!

Alevin log:
[2022-12-03 15:43:12.004] [alevinLog] [info] Found 117342 transcripts(+0 decoys, +35 short and +0 duplicate names in the index)
[2022-12-03 15:43:12.072] [alevinLog] [info] Filled with 117377 txp to gene entries
[2022-12-03 15:43:12.083] [alevinLog] [info] Found all transcripts to gene mappings
[2022-12-03 15:43:12.097] [alevinLog] [info] Processing barcodes files (if Present)

[2022-12-03 15:58:16.214] [alevinLog] [info] Done barcode density calculation.
[2022-12-03 15:58:16.214] [alevinLog] [info] # Barcodes Used: [32m639154757[0m / [31m639154757[0m.
[2022-12-03 15:58:20.629] [alevinLog] [info] Total [32m5461[0m(has [32m1000[0m low confidence) barcodes
[2022-12-03 15:58:21.139] [alevinLog] [info] Done True Barcode Sampling
[2022-12-03 15:58:23.597] [alevinLog] [warning] Total 56.2839% reads will be thrown away because of noisy Cellular barcodes.
[2022-12-03 15:58:23.987] [alevinLog] [info] Done populating Z matrix
[2022-12-03 15:58:24.080] [alevinLog] [info] Total 188093 CB got sequence corrected
[2022-12-03 15:58:24.122] [alevinLog] [info] Done indexing Barcodes
[2022-12-03 15:58:24.122] [alevinLog] [info] Total Unique barcodes found: 16409283
[2022-12-03 15:58:24.122] [alevinLog] [info] Used Barcodes except Whitelist: 187465
[2022-12-03 15:58:24.389] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2022-12-03 15:58:24.389] [alevinLog] [info] parsing read library format
[2022-12-03 16:17:47.714] [alevinLog] [info] Starting optimizer

[2022-12-03 16:17:47.868] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2022-12-03 16:17:47.868] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 17192510.00 UMI after deduplicating.
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 1333800 BiDirected Edges.
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 181036 UniDirected Edges.
[2022-12-03 16:18:01.432] [alevinLog] [info] Clearing EqMap; Might take some time.
[2022-12-03 16:18:08.522] [alevinLog] [info] Starting white listing of 5460 cells
[2022-12-03 16:18:08.522] [alevinLog] [info] Starting to make feature Matrix
[2022-12-03 16:18:08.535] [alevinLog] [info] Done making feature Matrix
[2022-12-03 16:18:09.324] [alevinLog] [info] Finished white listing
[2022-12-03 16:18:09.513] [alevinLog] [info] Finished optimizer

Salmon_quant log:
[2022-12-03 15:43:11.767] [jointLog] [info] setting maxHashResizeThreads to 48
[2022-12-03 15:43:11.767] [jointLog] [info] Fragment incompatibility prior below threshold. 
Incompatible fragments will be ignored.
[2022-12-03 15:43:11.767] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2022-12-03 15:43:11.767] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2022-12-03 15:43:11.767] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes.
[2022-12-03 15:43:11.767] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2022-12-03 15:43:11.767] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2022-12-03 15:58:24.389] [jointLog] [info] There is 1 library.
[2022-12-03 15:58:24.435] [jointLog] [info] Loading pufferfish index
[2022-12-03 15:58:24.454] [jointLog] [info] Loading dense pufferfish index.
[2022-12-03 15:58:26.905] [jointLog] [info] done
[2022-12-03 15:58:26.973] [jointLog] [info] Index contained 117,377 targets
[2022-12-03 15:58:27.002] [jointLog] [info] Number of decoys : 0
[2022-12-03 16:17:44.963] [jointLog] [info] Computed 194,611 rich equivalence classes for further processing
[2022-12-03 16:17:44.963] [jointLog] [info] Counted 35,761,228 total reads in the equivalence classes
[2022-12-03 16:17:44.964] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2022-12-03 16:17:44.964] [jointLog] [warning] Found 9264 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2022-12-03 16:17:44.964] [jointLog] [info] Mapping rate = 5.59508%

[2022-12-03 16:17:44.964] [jointLog] [info] finished quantifyLibrary()
rob-p commented 1 year ago

Ping @k3yavi here. One thing to try @umutcakir would be to see if you get a similar mapping rate outputting a RAD file for processing with alevin-fry. For example if you ran:

salmon alevin -l ISR \
-1 "Single_S1_L001_R1_001.fastq.gz" \
-2 "Single_S1_L001_R2_001.fastq.gz"  \
--chromiumV3  \
--sketch \
-i $my_index \
-p 48 \
-o outputfolder \

what happens to the mapping rate then? Also, as long as you are careful to pass the files in the same relative order, you can directly pass multiple input read files to the salmon alevin command as a space separated list; you don't have to concatenate them manually.

--Rob