COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
768 stars 161 forks source link

[Alevin] alevin quantification produce a lot fewer cells than cell ranger count #396

Closed ChelseaCHENX closed 4 years ago

ChelseaCHENX commented 5 years ago

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)? alevin Describe the bug alevin quantification produce a lot fewer cells than cell ranger count

To Reproduce I used alevin and cell ranger count for the same sample and checked the expression matrix shape from alevin, as well as html report from cell ranger count. But the former is having a lot fewer columns (cell number) than the estimated cell number from the latter

Codes for alevin

/zfs1/alee/nolan/apps/salmon/salmon-0.12.0_linux_x86_64/bin/salmon alevin -l ISR -i /zfs1/alee/nolan/apps/salmon/salmon-index/GRCh38-82-31-kmer-v0.12.0/GRCh38-82_index -1 [...]/*R1_001.fastq.gz -2 [...]/*R2_001.fastq.gz --chromiumV3 -p 12 -o [output folder] --tgMap /zfs1/alee/nolan/apps/salmon/tx2gene_GRCh38-82.tsv

Codes for cell ranger count

cellranger count --id=AL9 --sample=AL9 --transcriptome=/zfs1/alee/nolan/apps/cellranger/refdata-cellranger-GRCh38-3.0.0 --fastqs=/bgfs/alee/10Xseq/1907_10X/mkfastq/HCN3JDSXX/outs/fastq_path --expect-cells=8000

For three samples, the alevin cell numbers are 1192, 4947, 3414; while corresponding cell ranger outputs are 5150, 7618, 6404

I wonder what the problem is and how to deal with it given this huge inconsistency - suppose it could be the barcode demultiplexing problem as I did not use whitelist, but not sure.

Hope to get some help - appreciate it!

k3yavi commented 5 years ago

Hi @ChelseaCHENX ,

Thanks for raising the issue. I think if you can share the alevin log (say for 1192 cells ?) we can comment much more about the details. However, if you ask me to guess then I believe the initial whitelisting of alevin seems to be predicting a lot less cells, if you check the alevin log, it would say what % of CB are thrown due to noisy cellular barcodes. If the number is >20%, then the chances are indeed "knee" estimates are shooting up. The way to get better estimates from there would be to help alevin with a ballpark number of cells (as you are giving to cellranger with --expect-cell 8000, you can provide alevin with --expectCells 8000). Even after that if you get a lot of noisy CB prediction then you can force alevin to use certain number of cells with --forceCells option.

https://github.com/COMBINE-lab/salmon/issues/362 this issue might help you understand more the details of the pipeline. Hope it helps !

ChelseaCHENX commented 5 years ago

Hey Avi, thanks for the quick reply! Here is the salmon_quant_log file:

[2019-07-09 09:07:39.153] [jointLog] [info] Fragment incompatibility prior below threshold.  Incompatible fragments will be ignored.
[2019-07-09 09:07:39.153] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65
[2019-07-09 09:07:39.153] [jointLog] [info] Usage of --validateMappings implies use of range factorization. rangeFactorizationBins is being set to 4
[2019-07-09 09:07:39.153] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 1. Setting consensusSlack to 1.
[2019-07-09 09:07:39.153] [jointLog] [info] Using default value of 0.8 for minScoreFraction in Alevin
[2019-07-09 09:17:08.128] [jointLog] [info] There is 1 library.
[2019-07-09 09:17:08.180] [jointLog] [info] Loading Quasi index
[2019-07-09 09:17:08.180] [jointLog] [info] Loading 32-bit quasi index
[2019-07-09 09:17:14.970] [jointLog] [info] done
[2019-07-09 09:17:14.970] [jointLog] [info] Index contained 197,787 targets
[2019-07-09 10:02:20.484] [jointLog] [info] Computed 251,090 rich equivalence classes for further processing
[2019-07-09 10:02:20.484] [jointLog] [info] Counted 348,673,166 total reads in the equivalence classes 
[2019-07-09 10:02:20.485] [jointLog] [warning] Found 1893 reads with `N` in the UMI sequence and ignored the reads.
Please report on github if this number is too large
[2019-07-09 10:02:20.485] [jointLog] [info] Mapping rate = 39.7151%

[2019-07-09 10:02:20.485] [jointLog] [info] finished quantifyLibrary()
k3yavi commented 5 years ago

Sorry for the confusion, what I meant was the alevin log, it should be inside the alevin folder of your output subdirectory with the name alevin.log.

ChelseaCHENX commented 5 years ago

Get it - here it is:

[2019-07-09 09:07:39.162] [alevinLog] [info] Processing barcodes files (if Present) 

[2019-07-09 09:16:59.454] [alevinLog] [info] Done barcode density calculation.
[2019-07-09 09:16:59.454] [alevinLog] [info] # Barcodes Used: 877102495 / 877935734.
[2019-07-09 09:17:06.234] [alevinLog] [info] Knee found left boundary at  4375 
[2019-07-09 09:17:07.572] [alevinLog] [info] Gauss Corrected Boundary at  795 
[2019-07-09 09:17:07.572] [alevinLog] [info] Learned InvCov: 173.265 normfactor: 1097.45
[2019-07-09 09:17:07.597] [alevinLog] [info] Total 41.2673% reads will be thrown away because of noisy Cellular barcodes.
[2019-07-09 09:17:07.597] [alevinLog] [info] Total 1192(has 397 low confidence) barcodes
[2019-07-09 09:17:07.765] [alevinLog] [info] Done True Barcode Sampling
[2019-07-09 09:17:08.039] [alevinLog] [info] Done populating Z matrix
[2019-07-09 09:17:08.067] [alevinLog] [info] Done indexing Barcodes
[2019-07-09 09:17:08.067] [alevinLog] [info] Total Unique barcodes found: 7881525
[2019-07-09 09:17:08.067] [alevinLog] [info] Used Barcodes except Whitelist: 84951
[2019-07-09 09:17:08.128] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2019-07-09 09:17:08.128] [alevinLog] [info] parsing read library format
[2019-07-09 10:02:26.992] [alevinLog] [info] Starting optimizer

[2019-07-09 10:13:56.661] [alevinLog] [info] Total 99488568.00 UMI after deduplicating.
[2019-07-09 10:13:56.701] [alevinLog] [info] Clearing EqMap; Might take some time.
[2019-07-09 10:14:11.020] [alevinLog] [info] Starting Import of the gene count matrix of size 1192x60053.
[2019-07-09 10:14:11.286] [alevinLog] [info] Done initializing the empty matrix.
[2019-07-09 10:14:13.421] [alevinLog] [info] Done Importing gene count matrix for dimension 1192x60053
[2019-07-09 10:14:13.622] [alevinLog] [info] Starting white listing
[2019-07-09 10:14:13.627] [alevinLog] [info] Done importing order of barcodes "quants_mat_rows.txt" file.
[2019-07-09 10:14:13.627] [alevinLog] [info] Total 1192 barcodes found
[2019-07-09 10:14:13.627] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting
[2019-07-09 10:14:13.627] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting
[2019-07-09 10:14:13.627] [alevinLog] [info] Starting to make feature Matrix
[2019-07-09 10:14:13.885] [alevinLog] [info] Done making regular featues
[2019-07-09 10:14:13.885] [alevinLog] [info] Done making feature Matrix
[2019-07-09 10:14:13.891] [alevinLog] [info] Finished white listing
[2019-07-09 10:14:13.909] [alevinLog] [info] Finished optimizer

Indeed the fractions of BC thrown away is huge. I might need to try your --expectCells/--forceCells option - but not sure how this parameter influences final output? i.e. how to select a feasible number - 8000 might not be optimal strictly speaking.

Thanks! Chelsea

k3yavi commented 5 years ago

Yep, check this line in the log:

Total 41.2673% reads will be thrown away because of noisy Cellular barcodes.

I think the CB frequency is most probably a bimodal distribution, I'd suggest you to try --expectCells 8000 command line flag along with the regular command you are using above and check if the numbers in the log comes down to ~15%.

ChelseaCHENX commented 5 years ago

Sure - will keep you posted - thanks!

ChelseaCHENX commented 5 years ago

Hey Avi,

I have ran alevin with addition of --expectCells 8000 flag, the new output of cells detected: 3655, 5604, 4374 w/ 13%, 30%, 15.7% reads thrown away. It is better than the first trial 1192, 4947, 3414 but nevertheless fewer than the cell ranger output 5150, 7618, 6404.

Wonder

  1. if I should set higher --expectCells, but which would result in more unconfident calls?
  2. From 1, if I just try to get more cells subjectively, will the expression matrix (and further analysis) be inaccurate/affected? (given downstream filtering of cells of low quality based on # of feature detected etc. would be performed anyway. )
  3. what could be the reason that these two algorithms output such different total cell numbers (precision in calling?)

Thanks! Chelsea

k3yavi commented 5 years ago

Hi @ChelseaCHENX ,

Thanks for confirming the counts of the number of predicted cells. As much as I'd love to give you an exact answers, but in realty w/ current settings it requires a little more exploratory analysis. Whitelisting traditionally is done by making some greedy choices and generally can results in different number of predicted cells, and having an exact answer is difficult to have. For example, if you run alevin with --dumpFeatures and plot the frequency of CB, as dumped in the raw_cb_frequency.txt, you will observe a monotonically non-increasing function. Different tools try to get the "knee" in the distribution, so as alevin, as the first round of whitelisting. For cellranger, at least in my understanding, they try to take the top X% (I think it's 10) of the value suggested through expectCells command as high confidence and use all the CB which has the frequency greater than the lowest frequency of the high confidence barcodes for quantification. To counter the greediness of the CB calling, we in our suggested method for alevin, proposed a naive bayes based approach by learning features from not only CB frequency but various other features. There had been other methods like "emptyDrops" which you can try for more fine-grained whitelisting post quantification using alevin quants.

Having said that, if you use expectCells with bigger value, alevin will start to include more and more cells. However as the frequency of the new CB which gets included as high confidence with each new iteration drops exponentially, and even though the new CB gets merged to a high confidence barcode its chance of affecting the quantification also drops. In summary, if you are sure about your experiment to have more cells then it's ideal to increase the value otherwise, I think, with the increased expectCells value the quants can potentially be effected but most probably not by a lot.

Hope it helps !

ChelseaCHENX commented 5 years ago

Hi Avi, thanks for your detailed explanation!

From my understanding: a pre-selection of high-quality cells based on 1) CB frequency - finding the knee point (in the initial whitelisting) and 2) other features (in finalized/intelligent) whitelisting is performed in alevin, while cell ranger count does step 1) related to the --expectCells number and used an alternative method w/o knee point estimation.

Based on above, the newly included cells w/ increased number of --expectCells are also more likely to be filtered out in later steps using criteria such as min of number of features/reads detected per sample. But such filtering may not be expected if interests are also on cells with small transcriptomes such as TILs. I will try some downstream filtering to see how many good cells I can get.

Yeah it helps - thanks!

k3yavi commented 4 years ago

I hope this issue is resolved, closing due to inactivity. Feel free to reopen, in case you have any other questions.