Low Mapping Rate - Githubissues

Dear community,

I have 4 lane files for each sample. I want to run Alevin on these samples. Alevin runs without any error, but the mapping rate is too low (about 5%). I tried different k-mers, different versions of transcriptome datasets, and all protocols (--dropseq / --chromium / --chromiumV3), but my mapping rate is still too low.

Let me explain my workflow. These are my samples:

Sample1_S1_L001_I1_001.fastq.gz  Sample1_S1_L002_R1_001.fastq.gz  Sample1_S1_L003_R2_001.fastq.gz
Sample1_S1_L001_R1_001.fastq.gz  Sample1_S1_L002_R2_001.fastq.gz  Sample1_S1_L004_I1_001.fastq.gz
Sample1_S1_L001_R2_001.fastq.gz  Sample1_S1_L003_I1_001.fastq.gz  Sample1_S1_L004_R1_001.fastq.gz
Sample1_S1_L002_I1_001.fastq.gz  Sample1_S1_L003_R1_001.fastq.gz  Sample1_S1_L004_R2_001.fastq.gz

I generated single I1, R1 and R2 files by concatenating fastq.gz files across lanes; for instance, Sample1_S1_L001_R1_001.fastq.gz + Sample1_S1_L002_R1_001.fastq.gz + Sample1_S1_L003_R1_001.fastq.gz + Sample1_S1_L004_R1_001.fastq.gz -> Single_S1_L001_R1_001.fastq.gz

I indexed my transcriptome by salmon, and I used Alevin using the indexed transcriptome. I used the following command:

salmon alevin -l ISR \
-1 "Single_S1_L001_R1_001.fastq.gz" \
-2 "Single_S1_L001_R2_001.fastq.gz"  \
--chromiumV3  \
-i $my_index \
-p 48 \
-o outputfolder \
--tgMap $my_tgmap \
--expectCells 10000 \
--forceCells 10000

However, my mapping rate is only ~5%. I used cell ranger on the same data and the same transcriptome; the mapping rate is ~95%. Is there any bug on Alevin? How can I improve the mapping rate?

Thank you!

Alevin log:
[2022-12-03 15:43:12.004] [alevinLog] [info] Found 117342 transcripts(+0 decoys, +35 short and +0 duplicate names in the index)
[2022-12-03 15:43:12.072] [alevinLog] [info] Filled with 117377 txp to gene entries
[2022-12-03 15:43:12.083] [alevinLog] [info] Found all transcripts to gene mappings
[2022-12-03 15:43:12.097] [alevinLog] [info] Processing barcodes files (if Present)

[2022-12-03 15:58:16.214] [alevinLog] [info] Done barcode density calculation.
[2022-12-03 15:58:16.214] [alevinLog] [info] # Barcodes Used: [32m639154757[0m / [31m639154757[0m.
[2022-12-03 15:58:20.629] [alevinLog] [info] Total [32m5461[0m(has [32m1000[0m low confidence) barcodes
[2022-12-03 15:58:21.139] [alevinLog] [info] Done True Barcode Sampling
[2022-12-03 15:58:23.597] [alevinLog] [warning] Total 56.2839% reads will be thrown away because of noisy Cellular barcodes.
[2022-12-03 15:58:23.987] [alevinLog] [info] Done populating Z matrix
[2022-12-03 15:58:24.080] [alevinLog] [info] Total 188093 CB got sequence corrected
[2022-12-03 15:58:24.122] [alevinLog] [info] Done indexing Barcodes
[2022-12-03 15:58:24.122] [alevinLog] [info] Total Unique barcodes found: 16409283
[2022-12-03 15:58:24.122] [alevinLog] [info] Used Barcodes except Whitelist: 187465
[2022-12-03 15:58:24.389] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2022-12-03 15:58:24.389] [alevinLog] [info] parsing read library format
[2022-12-03 16:17:47.714] [alevinLog] [info] Starting optimizer

[2022-12-03 16:17:47.868] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2022-12-03 16:17:47.868] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 17192510.00 UMI after deduplicating.
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 1333800 BiDirected Edges.
[2022-12-03 16:18:01.362] [alevinLog] [info] Total 181036 UniDirected Edges.
[2022-12-03 16:18:01.432] [alevinLog] [info] Clearing EqMap; Might take some time.
[2022-12-03 16:18:08.522] [alevinLog] [info] Starting white listing of 5460 cells
[2022-12-03 16:18:08.522] [alevinLog] [info] Starting to make feature Matrix
[2022-12-03 16:18:08.535] [alevinLog] [info] Done making feature Matrix
[2022-12-03 16:18:09.324] [alevinLog] [info] Finished white listing
[2022-12-03 16:18:09.513] [alevinLog] [info] Finished optimizer

Salmon_quant log:
[2022-12-03 15:43:11.767] [jointLog] [info] setting maxHashResizeThreads to 48
[2022-12-03 15:43:11.767] [jointLog] [info] Fragment incompatibility prior below threshold. 
Incompatible fragments will be ignored.
[2022-12-03 15:43:11.767] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.
[2022-12-03 15:43:11.767] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2022-12-03 15:43:11.767] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter.  Disabling range-factorized equivalence classes.
[2022-12-03 15:43:11.767] [jointLog] [info] Setting consensusSlack to selective-alignment default of 0.35.
[2022-12-03 15:43:11.767] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin
Using default value of 0.6 for consensusSlack in Alevin
[2022-12-03 15:58:24.389] [jointLog] [info] There is 1 library.
[2022-12-03 15:58:24.435] [jointLog] [info] Loading pufferfish index
[2022-12-03 15:58:24.454] [jointLog] [info] Loading dense pufferfish index.
[2022-12-03 15:58:26.905] [jointLog] [info] done
[2022-12-03 15:58:26.973] [jointLog] [info] Index contained 117,377 targets
[2022-12-03 15:58:27.002] [jointLog] [info] Number of decoys : 0
[2022-12-03 16:17:44.963] [jointLog] [info] Computed 194,611 rich equivalence classes for further processing
[2022-12-03 16:17:44.963] [jointLog] [info] Counted 35,761,228 total reads in the equivalence classes
[2022-12-03 16:17:44.964] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0
[2022-12-03 16:17:44.964] [jointLog] [warning] Found 9264 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2022-12-03 16:17:44.964] [jointLog] [info] Mapping rate = 5.59508%

[2022-12-03 16:17:44.964] [jointLog] [info] finished quantifyLibrary()

COMBINE-lab / salmon

Low Mapping Rate #814