Closed crazyhottommy closed 5 years ago
not sure what's the best way to share the 34G fastqs for your test.
I think I see the issue, and we might not actually need the data.
The issue being the list of CBs in whitelist.txt
is too many. It seems the data is very noisy and knee
finding algorithm is failing. I'd suggest to use --dumpFeatures
flag with alevin (you can also do --noQuant
to stop alevin before mapping). This flag will generate raw_cb_frequency.txt
file i.e. the frequency of all the observed CB and making the histogram will help visually find if there is possible knee in the data, generally it is in a real experiment.
re: the mapping rate, in the alevin log you'll observe there were 50% of the CB which were thrown away if you use forceCells 3000
w/o whitelist, basically alevin is taking top 3000
cells into consideration and throwing everything away. While the rise in mapping rate when externally provided with whitelist is actually by allowing more CB to go through. The two options forceCells
and whitelist
wan't intended to be use simultaneously because if provided with external whitelist alevin assumes the user is confident with the set of CBs, in your case this list is huge (~700k). Are you using the 10x whitelist from their website? If yes, then it's not what alevin expect with --whitelist
option.
One work around here would be to look at the CB frequency histogram and making a file with only the CB sequences which you thing are above knee. Providing that file as the --whitelist
is the intended use case for alevin.
Hope this makes sense.
yes, I am using the 10x whitelist downloaded from their website. Thanks for clarifying it. so should I not use --whitelist at all and let alevin determine what BCs to use? 50% of reads throwing away seems to be too much. what percentage do you observe? Thanks! without whitelist and use --expectCells 3000 gives me error] Can't find right Boundary. I should do --dumpFeatures and --noQuant to get the whitelist first?
Hi @crazyhottommy ,
Reading a bit in https://github.com/COMBINE-lab/salmon/issues/245 and https://github.com/COMBINE-lab/salmon/issues/284 would definitely help understand more about the pipeline.
summary: it's possible the 50% mapping read is due to alevin throwing away CB. In this case, we normally suggest to try running alevin in the following order:
1.) --expectCells X
: alevin will look for local knee threshold near to top X barcodes. Alevin can still here based on the CB frequency distribution.
2.) --forceCells X
: run alevin with --noQuant --dumpFeature
mode and extract the frequency histogram of the CBs present in the histogram. Try to manually found knee in the descending sorted CB frequency histogram and figure out X. alevin will use top X barcodes as specified by the user.
3.) If there is other tools like cellranger
already been run on the data, then alevin can directly consume their predicted CB sequence using --whitelist
option. This is different file, not the 727k file from 10x, and is different for each experiment .
Thanks for the links. closing now.
Hi @k3yavi ,
I'm using alevin to process 10X V3 data and encountered similar problem with this issue.
I've tried run the pipeline using default whitelisting by alevin, --whitelist barcode.txt
which from cellranger v.3.1.0 run (including 7938 barcodes), --expectCells 10000
, and --expectCells 30000
.
But no matter how I change the parameter, the log shows that there are always about 50% percent reads has been thrown away, and the mapping rate was between 18.7%-19.1%.
the salmon version is salmon 1.4.0
the reference genome is sequenced by ourselves, and it's a plant.
my reads layout is paired end 150bp,
R1: @A00582:424:HJYLGDSXY:3:1101:1090:1000 1:N:0:ACCGGCTC TAACCAGGTCGAGTGAGTATTTAAGGCGCGCGGCGCACCAACGCACTCCCAACAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFF,,:,,FF:,,,::FF,,:F,,,,,,F:,,,:,::FF::::::,FFF:F:FF:FFFFFFF::FF::FF,F:F:FF:F,FFFF,:FF,FFFFF:,FF:::FF:FFF:FF:FF:FFFFFFFFFF: R2: @A00582:424:HJYLGDSXY:3:1101:1090:1000 2:N:0:ACCGGCTC NCCTAGAAGCAGCCACCCTTGAAAGAGTGCGTAATAGCTCACTGATCGAGCGCTCTTGCGCCGAAGATGAACGGGGCTAAGCGATCTGCCGAAGCTGTGGGATGTAAAAATACATCGGTAGGGGAGCGTTCCGCCTTAGAGAGAAGCCTC +
FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF::FFFFFFFFFF:FFFFFFFFFFFFF:FFFFFF:FFFFFFF:
Here is the logs
salmon alevin -l ISR -1 ../clean/sample_S1_L001_R1_001.fastq -2 ../clean/sample_S1_L001_R2_001.fastq --chromiumV3 -i ../../dna/00.ref_genome/sample/salmon_index_allmRNA -p 40 -o test_alevin_allmRNA --tgMap ../../dna/00.ref_genome/sample/alltxp2gene.tsv
[2021-01-25 16:22:09.565] [alevinLog] [info] Found 43030 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2021-01-25 16:22:09.615] [alevinLog] [info] Filled with 43030 txp to gene entries [2021-01-25 16:22:09.620] [alevinLog] [info] Found all transcripts to gene mappings [2021-01-25 16:22:09.631] [alevinLog] [info] Processing barcodes files (if Present) [2021-01-25 16:26:35.067] [alevinLog] [info] Done barcode density calculation. [2021-01-25 16:26:35.067] [alevinLog] [info] # Barcodes Used: 188934609 / 188934609. [2021-01-25 16:26:42.979] [alevinLog] [info] Knee found left boundary at 21 [2021-01-25 16:27:05.707] [alevinLog] [warning] Gauss Prediction 4969 Too far from knee prediction skipping it [2021-01-25 16:27:05.707] [alevinLog] [info] Learned InvCov: 556.394 normfactor: 9159.58 [2021-01-25 16:27:05.707] [alevinLog] [info] Total 222(has 201 low confidence) barcodes [2021-01-25 16:27:06.573] [alevinLog] [info] Done True Barcode Sampling [2021-01-25 16:27:07.383] [alevinLog] [warning] Total 96.7029% reads will be thrown away because of noisy Cellular barcodes. [2021-01-25 16:27:07.412] [alevinLog] [info] Done populating Z matrix [2021-01-25 16:27:07.414] [alevinLog] [info] Total 3667 CB got sequence corrected [2021-01-25 16:27:07.414] [alevinLog] [info] Done indexing Barcodes [2021-01-25 16:27:07.414] [alevinLog] [info] Total Unique barcodes found: 3896665 [2021-01-25 16:27:07.414] [alevinLog] [info] Used Barcodes except Whitelist: 3667 [2021-01-25 16:27:07.498] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify [2021-01-25 16:27:07.498] [alevinLog] [info] parsing read library format [2021-01-25 16:30:54.542] [alevinLog] [info] Starting optimizer [2021-01-25 16:30:54.782] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2021-01-25 16:30:54.782] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2021-01-25 16:30:55.950] [alevinLog] [info] Total 1350278.00 UMI after deduplicating. [2021-01-25 16:30:55.950] [alevinLog] [info] Total 30909 BiDirected Edges. [2021-01-25 16:30:55.950] [alevinLog] [info] Total 8817 UniDirected Edges. [2021-01-25 16:30:55.969] [alevinLog] [info] Clearing EqMap; Might take some time. [2021-01-25 16:30:56.294] [alevinLog] [warning] Num High confidence barcodes too less 20 < 90.Can't performing whitelisting; Skipping [2021-01-25 16:30:56.297] [alevinLog] [info] Finished optimizer
--exceptCells 7000
[2021-01-21 09:24:45.891] [alevinLog] [info] Found 43030 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2021-01-21 09:24:45.942] [alevinLog] [info] Filled with 43030 txp to gene entries [2021-01-21 09:24:45.947] [alevinLog] [info] Found all transcripts to gene mappings [2021-01-21 09:24:45.967] [alevinLog] [info] Processing barcodes files (if Present) [2021-01-21 09:33:35.885] [alevinLog] [info] Done barcode density calculation. [2021-01-21 09:33:35.885] [alevinLog] [info] # Barcodes Used: 188934609 / 188934609. [2021-01-21 09:33:37.337] [alevinLog] [info] Total 10016(has 1000 low confidence) barcodes [2021-01-21 09:33:38.202] [alevinLog] [info] Done True Barcode Sampling [2021-01-21 09:33:39.137] [alevinLog] [warning] Total 52.0343% reads will be thrown away because of noisy Cellular barcodes. [2021-01-21 09:33:39.960] [alevinLog] [info] Done populating Z matrix [2021-01-21 09:33:39.989] [alevinLog] [info] Total 34923 CB got sequence corrected [2021-01-21 09:33:39.994] [alevinLog] [info] Done indexing Barcodes [2021-01-21 09:33:39.994] [alevinLog] [info] Total Unique barcodes found: 3896665 [2021-01-21 09:33:39.994] [alevinLog] [info] Used Barcodes except Whitelist: 34503 [2021-01-21 09:33:40.718] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify [2021-01-21 09:33:40.718] [alevinLog] [info] parsing read library format [2021-01-21 09:48:11.430] [alevinLog] [info] Starting optimizer [2021-01-21 09:48:12.160] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2021-01-21 09:48:12.160] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2021-01-21 09:48:36.288] [alevinLog] [info] Total 19031525.00 UMI after deduplicating. [2021-01-21 09:48:36.288] [alevinLog] [info] Total 454402 BiDirected Edges. [2021-01-21 09:48:36.288] [alevinLog] [info] Total 113688 UniDirected Edges. [2021-01-21 09:48:36.288] [alevinLog] [warning] Skipped 44 barcodes due to No mapped read [2021-01-21 09:48:36.307] [alevinLog] [info] Clearing EqMap; Might take some time. [2021-01-21 09:48:41.314] [alevinLog] [info] Starting white listing of 9971 cells [2021-01-21 09:48:41.314] [alevinLog] [info] Starting to make feature Matrix [2021-01-21 09:48:41.337] [alevinLog] [info] Done making feature Matrix [2021-01-21 09:48:41.557] [alevinLog] [info] Finished white listing [2021-01-21 09:48:41.580] [alevinLog] [info] Finished optimizer
{ "total_reads": 188934609, "reads_with_N": 0, "noisy_cb_reads": 98310747, "noisy_umi_reads": 16600, "used_reads": 90607262, "mapping_rate": 18.89108045842464, "reads_in_eqclasses": 35691789, "total_cbs": 3896665, "used_cbs": 44518, "initial_whitelist": 9015, "low_conf_cbs": 1000, "num_features": 5, "no_read_mapping_cbs": 44, "final_num_cbs": 6765, "deduplicated_umis": 19031525, "mean_umis_per_cell": 2813, "mean_genes_per_cell": 1315 }
--exceptCells 30000
... [2021-01-23 11:07:52.910] [alevinLog] [info] Done barcode density calculation. [2021-01-23 11:07:52.910] [alevinLog] [info] # Barcodes Used: 188934609 / 188934609. [2021-01-23 11:07:54.387] [alevinLog] [info] Total 12507(has 995 low confidence) barcodes [2021-01-23 11:07:55.251] [alevinLog] [info] Done True Barcode Sampling [2021-01-23 11:07:56.200] [alevinLog] [info] Total 49.0191% reads will be thrown away because of noisy Cellular barcodes. [2021-01-23 11:07:57.144] [alevinLog] [info] Done populating Z matrix [2021-01-23 11:07:57.172] [alevinLog] [info] Total 35787 CB got sequence corrected [2021-01-23 11:07:57.177] [alevinLog] [info] Done indexing Barcodes [2021-01-23 11:07:57.177] [alevinLog] [info] Total Unique barcodes found: 3896665 [2021-01-23 11:07:57.177] [alevinLog] [info] Used Barcodes except Whitelist: 35219 [2021-01-23 11:07:57.360] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify ....
{ "total_reads": 188934609, "reads_with_N": 0, "noisy_cb_reads": 92614076, "noisy_umi_reads": 17028, "used_reads": 96303505, "mapping_rate": 19.451325087824434, "reads_in_eqclasses": 36750285, "total_cbs": 3896665, "used_cbs": 47725, "initial_whitelist": 11511, "low_conf_cbs": 995, "num_features": 5, "no_read_mapping_cbs": 70, "final_num_cbs": 8324, "deduplicated_umis": 19613485, "mean_umis_per_cell": 2356, "mean_genes_per_cell": 1120 }
By the way, the cellranger result shows that reads map to Transcriptome is low, but reads mapped to Genome is 85%
Reads Mapped to Genome | 85.2% |
---|---|
Reads Mapped Confidently to Genome | 45.8% |
Reads Mapped Confidently to Intergenic Regions | 11.0% |
Reads Mapped Confidently to Intronic Regions | 4.2% |
Reads Mapped Confidently to Exonic Regions | 30.6% |
Reads Mapped Confidently to Transcriptome | 25.3% |
Reads Mapped Antisense to Gene | 0.9% |
Estimated Number of Cells | 7,938 |
---|---|
Fraction Reads in Cells | 73.1% |
Mean Reads per Cell | 23,801 |
Median Genes per Cell | 1,076 |
Total Genes Detected | 17,492 |
Median UMI Counts per Cell | 2,155 |
Best wishes, Matthew
Hi,
I am using Alevin to process some 10x v2 data.
when not providing whitelist the mapping rate is low 20%, and many reads are discarded because of the noisy barcodes.
not providing whitelist, forceCells 3000
no whitelist, expectCells 3000
whitelist forceCells 3000
The mapping rate was boost to 37%, but now the forceCells and expectCells seems not work, alevin still processed over 400,000 cells.