Parameter settings for ECCITE-seq like approaches

COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

https://combine-lab.github.io/salmon

GNU General Public License v3.0

768 stars 161 forks source link

Parameter settings for ECCITE-seq like approaches #531

Open rfarouni opened 4 years ago

rfarouni commented 4 years ago

Hi, I would like quantify guide-RNAs (based on 5'-tagged scRNAseq 10X feature barcoding) using Alevin. Read 1 is 26bps long (16 CB +10 UMI) and Read 2 is 58bps long (19 constant region + 21 guide sequence). Now, when I use the following settings

salmon alevin -l ISR --barcodeLength 16 --umiLength 10 --end 5 --featureStart 19 --featureLength 21

I get this error

Transcript to Gene Map File not provided.

However, when I use the following instead

salmon alevin -l ISR --citeseq --featureStart 19 --featureLength 21

It works but since --citeseq assumes --umiLength=12, I get the following output

`[2020-06-03 13:53:30.298] [alevinLog] [info] set CITE-seq minScoreFraction parameter to : 0.797619

[2020-06-03 13:53:30.298] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-03 13:53:30.298] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-03 13:53:30.298] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-03 13:53:30.304] [alevinLog] [info] Processing barcodes files (if Present)

processed 52 Million barcodes

[2020-06-03 13:54:43.733] [alevinLog] [info] Done barcode density calculation. [2020-06-03 13:54:43.733] [alevinLog] [info] # Barcodes Used: 52200250 / 52200250. [2020-06-03 13:54:43.826] [alevinLog] [info] Forcing to use 100000 cells [2020-06-03 13:54:43.964] [alevinLog] [info] Throwing 49909 barcodes with < 10 reads [2020-06-03 13:54:43.984] [alevinLog] [info] Total 50092(has 201 low confidence) barcodes [2020-06-03 13:54:44.191] [alevinLog] [info] Done True Barcode Sampling [2020-06-03 13:54:44.285] [alevinLog] [info] Total 1.70493% reads will be thrown away because of noisy Cellular barcodes. [2020-06-03 13:54:45.790] [alevinLog] [info] Done populating Z matrix [2020-06-03 13:54:45.790] [alevinLog] [info] Total 0 CB got sequence corrected [2020-06-03 13:54:45.790] [alevinLog] [info] Done indexing Barcodes [2020-06-03 13:54:45.790] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-03 13:54:45.790] [alevinLog] [info] Used Barcodes except Whitelist: 0 [2020-06-03 13:54:46.493] [jointLog] [info] There is 1 library.

[2020-06-03 13:54:46.551] [jointLog] [info] Loading pufferfish index [2020-06-03 13:54:46.551] [jointLog] [info] Loading dense pufferfish index. [2020-06-03 13:54:46.552] [jointLog] [info] done [2020-06-03 13:54:46.552] [jointLog] [info] Index contained 64 targets [2020-06-03 13:54:46.552] [jointLog] [info] Number of decoys : 0

[2020-06-03 13:54:46.493] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

processed 52 Million fragmentsvinLog] [info] parsing read library format hits: 0, hits per frag: 0

[2020-06-03 13:55:42.905] [alevinLog] [info] Starting optimizer

[2020-06-03 13:55:42.931] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-03 13:55:42.931] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-03 13:55:42.933] [alevinLog] [info] Total 0.00 UMI after deduplicating. [2020-06-03 13:55:42.933] [alevinLog] [info] Total 0 BiDirected Edges. [2020-06-03 13:55:42.933] [alevinLog] [info] Total 0 UniDirected Edges. [2020-06-03 13:55:42.933] [alevinLog] [warning] Skipped 50091 barcodes due to No mapped read [2020-06-03 13:55:42.934] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-03 13:55:42.940] [alevinLog] [warning] Num Low confidence barcodes too less 1 < 200.Can't performing whitelisting; Skipping [2020-06-03 13:55:42.940] [alevinLog] [info] Finished optimizer `

I also tried

salmon alevin -l ISR --chromium --featureStart 19 --featureLength 21 --tgMap guide_to_gene.tsv

But I get the following output

` [2020-06-03 13:47:17.330] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-03 13:47:17.330] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-03 13:47:17.330] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-03 13:47:17.336] [alevinLog] [info] Processing barcodes files (if Present)

processed 52 Million barcodes

[2020-06-03 13:48:30.047] [alevinLog] [info] Done barcode density calculation. [2020-06-03 13:48:30.047] [alevinLog] [info] # Barcodes Used: 52200250 / 52200250. [2020-06-03 13:48:33.285] [alevinLog] [info] Knee found left boundary at 1174 [2020-06-03 13:48:34.501] [alevinLog] [info] Gauss Corrected Boundary at 148 [2020-06-03 13:48:34.501] [alevinLog] [info] Learned InvCov: 985.935 normfactor: 763.254 [2020-06-03 13:48:34.501] [alevinLog] [info] Total 349(has 201 low confidence) barcodes [2020-06-03 13:48:35.369] [alevinLog] [info] Done True Barcode Sampling [2020-06-03 13:48:35.441] [alevinLog] [warning] Total 73.3629% reads will be thrown away because of noisy Cellular barcodes. [2020-06-03 13:48:35.454] [alevinLog] [info] Done populating Z matrix [2020-06-03 13:48:35.455] [alevinLog] [info] Total 4286 CB got sequence corrected [2020-06-03 13:48:35.455] [alevinLog] [info] Done indexing Barcodes [2020-06-03 13:48:35.455] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-03 13:48:35.455] [alevinLog] [info] Used Barcodes except Whitelist: 4282 [2020-06-03 13:48:35.558] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify ... processed 52 Million fragments hits: 0, hits per frag: 0

[2020-06-03 13:49:37.892] [jointLog] [info] Computed 0 rich equivalence classes for further processing [2020-06-03 13:49:37.892] [jointLog] [info] Counted 0 total reads in the equivalence classes [2020-06-03 13:49:37.893] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0 [2020-06-03 13:49:37.893] [jointLog] [warning] Found 370 reads with N in the UMI sequence and ignored the reads. Please report on github if this number is too large [2020-06-03 13:49:37.893] [jointLog] [info] Mapping rate = 0%

[2020-06-03 13:49:37.893] [jointLog] [info] finished quantifyLibrary() [2020-06-03 13:49:37.899] [alevinLog] [info] Starting optimizer

[2020-06-03 13:49:38.613] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-03 13:49:38.613] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-03 13:49:38.614] [alevinLog] [info] Total 0.00 UMI after deduplicating. [2020-06-03 13:49:38.614] [alevinLog] [info] Total 0 BiDirected Edges. [2020-06-03 13:49:38.614] [alevinLog] [info] Total 0 UniDirected Edges. [2020-06-03 13:49:38.614] [alevinLog] [warning] Skipped 348 barcodes due to No mapped read [2020-06-03 13:49:38.614] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-03 13:49:38.620] [alevinLog] [warning] Num Low confidence barcodes too less 1 < 200.Can't performing whitelisting; Skipping [2020-06-03 13:49:38.620] [alevinLog] [info] Finished optimizer Floating point exception (core dumped) `

Any suggestions on how to get this working are highly appreciated!

Thanks

k3yavi commented 4 years ago

Hi @rfarouni ,

Thanks a lot for raising the issue. It looks like a corner case with the custom barcode length and I'd have to push a hot-fix for it. Basically, it's failing in the initial sanity check stage where it assumes we can provide only one single-cell protocol type. Give me like half an hour to make the changes and I'll push the fix to the develop. If you can compile salmon from source that's great, otherwise I can also forward a linux portable binary.

k3yavi commented 4 years ago

Oh one more thing, is it possible to share a few hundred reads for your experiment, just for some unit testing on my side?

k3yavi commented 4 years ago

https://github.com/COMBINE-lab/salmon/commit/d6abefd699d8c4e34ade5f01ea7e4686d28c05ff this should fix it .

k3yavi commented 4 years ago

Seems to work on some of the test at my end, let me know if its still a problem. The command to use would be

salmon alevin -l ISR --citeseq --barcodeLength 16 --umiLength 10 --end 5 --featureStart 19 --featureLength 21

One thing to note, since it's a 5' protocol, you might have to change -lISR to -lISF since the 5` protocol expects the single-cell reads from the forward strand, unlike 3' where we expect the reads from reverse. It should not be a problem for the guide/feature barcodes though.

rfarouni commented 4 years ago

Thanks @k3yavi !If you can forward me a Linux portable binary that would be great. Whenever I try to compile something on my computer, I fail half of the time . I have Ubuntu 18.04. I will ask permission to share with you part of the data and get back to you. Also, does Alevin use 10x cell barcode whitelist internally to correct barcodes? And do you recommend using the --naiveEqclass option when there are only 64 guide sequences as features?

k3yavi commented 4 years ago

Hi @rfarouni ,

Please use this link to download a linux compatible binary, the fix will be available by default with the next release .

Also, does Alevin use 10x cell barcode whitelist internally to correct barcodes?

In our experiments, we find that, in expectation, the 10x generated experiments are clean enough that we don't need the 10x whitelisted barcode to be explicitly specified or used.

And do you recommend using the --naiveEqclass only 64 guide sequences as features ?

That's a very good question. Basically the answer lies in how complicated the UMI graph network is. Experiment with the antibody derived barcodes (ADT) with 20 protein panel, generally, doesn't need the --naiveEqclass mode UMI deduplication, unless the experiment is super deeply sequenced. However, for super low diversity like 4-8 barcodes e.g. for HTO like sample barcodes, the graphical network becomes exponentially hard to solve and significantly increases the running time for alevin.

In general, I'd recommend if you expect very low diversity in the number of barcodes in your experiment, use --naiveEqclass otherwise prefer avoiding it. Generally, the experiment with low diversity barcodes results in such a highly dense count matrix that a few error in UMI deduplication won't matter and you can tradeoff extra long running time with reasonable under/over UMI deduplicated counts.

In short, 64 guide sequences are relatively high diversity and I'd advise skipping --naiveEqclass in your command line argument.

Hope it helps !

rfarouni commented 4 years ago

Hi Avi,

Thanks for the detailed reply. I was able to run it (see logs below), but I had to use ISR, not ISF to get it to work. I also had to add these two settings as well --freqThreshold 1 --lowRegionMinNumBarcodes 100.

I am not sure why the ISF option didn't work, but probably it has something to do with the guide sequences I was provided. At any rate, I have a few other questions I hope you can help me answer.

Why does increasing --maxNumBarcodes to 200000 results in no barcodes getting corrected? (See log for Run 2 below). What is the rationale for the current default of 100000?
For the downstream analysis of such data, I usually work with both the read and UMI counts, but quants_mat.gz only contains the UMI counts. Can Alevin a produce a matrix of read counts as well. It would be a great feature to add. For now, what is easiest way to get the cell x feature matrix of read counts if I use the --dumpEq or --dumpBfh flags? Can tximport be used for this or do I need to use the Python parser first?

I will be sending you some reads from the experiments for unit testing shortly. Thanks!

Run 1: salmon alevin -l ISR --citeseq --barcodeLength 16 --umiLength 10 --end 5 --featureStart 19 --featureLength 21 --maxNumBarcodes 100000 --freqThreshold 1 --lowRegionMinNumBarcodes 100

20-06-04 12:24:47.610] [alevinLog] [info] set CITE-seq minScoreFraction parameter to : 0.797619 [2020-06-04 12:24:47.610] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-04 12:24:47.610] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-04 12:24:47.610] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-04 12:24:47.616] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-04 12:26:04.322] [alevinLog] [info] Done barcode density calculation. [2020-06-04 12:26:04.322] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-04 12:26:04.435] [alevinLog] [info] Forcing to use 100000 cells [2020-06-04 12:26:05.171] [alevinLog] [info] Throwing 0 barcodes with < 1 reads [2020-06-04 12:26:05.872] [alevinLog] [info] Total [32m95377[0m(has [32m101[0m low confidence) barcodes [2020-06-04 12:26:06.746] [alevinLog] [info] Done True Barcode Sampling [2020-06-04 12:26:06.880] [alevinLog] [info] Total 1.2299% reads will be thrown away because of noisy Cellular barcodes. [2020-06-04 12:26:10.886] [alevinLog] [info] Done populating Z matrix [2020-06-04 12:26:10.924] [alevinLog] [info] Total 118774 CB got sequence corrected [2020-06-04 12:26:10.936] [alevinLog] [info] Done indexing Barcodes [2020-06-04 12:26:10.936] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-04 12:26:10.936] [alevinLog] [info] Used Barcodes except Whitelist: 88156 [2020-06-04 12:26:11.113] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-04 12:26:11.113] [alevinLog] [info] parsing read library format [2020-06-04 12:27:21.373] [alevinLog] [info] Starting optimizer

[2020-06-04 12:27:22.086] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-04 12:27:22.086] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-04 12:27:22.409] [alevinLog] [info] Total 23937.00 UMI after deduplicating. [2020-06-04 12:27:22.409] [alevinLog] [info] Total 91 BiDirected Edges. [2020-06-04 12:27:22.409] [alevinLog] [info] Total 82 UniDirected Edges. [2020-06-04 12:27:22.409] [alevinLog] [warning] Skipped 82268 barcodes due to No mapped read [2020-06-04 12:27:22.412] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-04 12:27:22.418] [alevinLog] [warning] Num Low confidence barcodes too less 1 < 100.Can't performing whitelisting; Skipping [2020-06-04 12:27:22.418] [alevinLog] [info] Finished optimizer

Run 2: salmon alevin -l ISR --citeseq --barcodeLength 16 --umiLength 10 --end 5 --featureStart 19 --featureLength 21 --maxNumBarcodes 200000 --freqThreshold 1 --lowRegionMinNumBarcodes 100

[2020-06-04 12:40:45.455] [alevinLog] [info] set CITE-seq minScoreFraction parameter to : 0.797619 [2020-06-04 12:40:45.456] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-04 12:40:45.456] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-04 12:40:45.456] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-04 12:40:45.461] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-04 12:42:01.202] [alevinLog] [info] Done barcode density calculation. [2020-06-04 12:42:01.202] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-04 12:42:01.300] [alevinLog] [info] Forcing to use 200000 cells [2020-06-04 12:42:02.037] [alevinLog] [info] Throwing 0 barcodes with < 1 reads [2020-06-04 12:42:02.738] [alevinLog] [info] Total [32m197328[0m(has [32m101[0m low confidence) barcodes [2020-06-04 12:42:03.656] [alevinLog] [info] Done True Barcode Sampling [2020-06-04 12:42:03.830] [alevinLog] [info] Total 0.780192% reads will be thrown away because of noisy Cellular barcodes. [2020-06-04 12:42:13.353] [alevinLog] [info] Done populating Z matrix [2020-06-04 12:42:13.353] [alevinLog] [info] Total 0 CB got sequence corrected [2020-06-04 12:42:13.353] [alevinLog] [info] Done indexing Barcodes [2020-06-04 12:42:13.353] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-04 12:42:13.353] [alevinLog] [info] Used Barcodes except Whitelist: 0 [2020-06-04 12:42:13.555] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-04 12:42:13.556] [alevinLog] [info] parsing read library format [2020-06-04 12:43:22.789] [alevinLog] [info] Starting optimizer

[2020-06-04 12:43:23.499] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-04 12:43:23.499] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-04 12:43:23.835] [alevinLog] [info] Total 24009.00 UMI after deduplicating. [2020-06-04 12:43:23.835] [alevinLog] [info] Total 89 BiDirected Edges. [2020-06-04 12:43:23.835] [alevinLog] [info] Total 82 UniDirected Edges. [2020-06-04 12:43:23.835] [alevinLog] [warning] Skipped 184123 barcodes due to No mapped read [2020-06-04 12:43:23.840] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-04 12:43:23.846] [alevinLog] [warning] Num Low confidence barcodes too less 1 < 100.Can't performing whitelisting; Skipping [2020-06-04 12:43:23.846] [alevinLog] [info] Finished optimizer

rob-p commented 4 years ago

Hi @rfarouni,

I'll let @k3yavi do the heavy lifting on your questions above, but I was curious the purpose for which you might want to use a read count matrix? Is there some particular type of analysis where you would be interested in the non-de-duplicated fragment counts for each gene?

Thanks! Rob

rfarouni commented 4 years ago

Hi Rob,

The joint distribution of the read and UMI counts can contain important information. The majority of observations (CB + guide combination) lie along a well defined experiment-specific mean trend whose slope is given by the coverage ( ratio of reads to UMIs). Also the same regularity can be observed when aggregating across the cell barcodes. See figure below. The points below the black horizontal line are cells with less than 100 reads

At the guide level, it would look like this

In general, I often find myself needing to work with read counts. For example, the read counts can be used to estimate the hopping rate and detect hopped reads in multiplexed scRNAseq data as we show in this recent paper https://www.nature.com/articles/s41467-020-16522-z

Rick,

rob-p commented 4 years ago

Thanks for the detailed answer, Rick!

I just saw that paper pop up yesterday and it was on my reading list :). Internally, we have access to the number of occurrences of each UMI, gene pair within each barcode, so I do not think it would be too difficult to to provide read counts (optionally) along with deduplicated counts (though @k3yavi would be best equipped to say how easy or difficult this would be from the implementation perspective).

Best, Rob

rfarouni commented 4 years ago

You are welcome!

k3yavi commented 4 years ago

Hi @rfarouni ,

Thanks for the detailed answer.

I am not sure why the ISF option didn't work, but probably it has something to do with the guide sequences I was provided. At any rate, I have a few other questions I hope you can help me answer.

I'm not sure about this, it's possible if the guide sequences were already reverse-complemented then the above behavior would makes sense. I am a little less familiar with the guideRNA based ECCITE-seq data, although the mRNA library should be 5' and the sequence does come from forward strand but do we expect the guide RNA to be on the forward strand as well ? Unclear . I'll ask around at nygc and would let you know.

Why does increasing --maxNumBarcodes to 200000 results in no barcodes getting corrected? (See log for Run 2 below). What is the rationale for the current default of 100000?

That's again a great question. In short single-cell world is expanding rapidly and alevin was initially designed to work with 10x 3' data and some of the restriction are outdated with combinatorial indexing based multiplexed experiments. To be honest, 100k was just a random high enough number that was put down to throw away the obvious junk data. Having said that, you would notice that in both the logs you attached a significant fraction of barcodes are thrown away i.e., Skipped 82268 barcodes due to No mapped read, which is like ~82% of the 100k barcodes. Even if you include the 200k almost everything was thrown away, [warning] Skipped 184123 barcodes due to No mapped read. Although your point is important one why things are not getting sequence corrected with 200k, unfortunately I might have to do some more testing on that front to give more precise answer but in your case I'd advise keeping the default 100k bound, unless you are doing combinatorially-indexed data .

For the downstream analysis of such data, I usually work with both the read and UMI counts, but quants_mat.gz only contains the UMI counts. Can Alevin a produce a matrix of read counts as well. It would be a great feature to add. For now, what is easiest way to get the cell x feature matrix of read counts if I use the --dumpEq or --dumpBfh flags? Can tximport be used for this or do I need to use the Python parser first?

Congratulations on the awesome paper :). We were actually discussing yesterday about your paper and potentially modifying alevin to include model for correcting index-hoping, although it's still in discussion phase. To answer your question, thanks for the feature request, I can add that feature on the weekend if it's urgent. However, you can also generate that with the current version using the --dumpBfh flag and bfh.txt file. Unfortunately, since it's the first use case where we need the read-count matrix, we haven't included the support in tximport yet. However, like you said one can use the read_bfh function to generate the cell-v-gene read count matrix in python. It does adds one additional layer of complexity though.

Let me know your thoughts.

rfarouni commented 4 years ago

Hi Avi,

Yes I just asked and the guide sequences were reverse complemented.

I was looking through the results and comparing it with the output of another alignment software. I noticed that there are substantially fewer UMI per guide (in cell) throughout ( see figures for comparison).

Also, the number of UMIs per cell barcode is consistently lower and there is around 796 barcodes that are not found in the 10X whitelist, the majority of which tend to have 1 UMI count only. Here is tally, where the TRUE column indicates the barcode is found in the whitelist. The row names indicate the total number of UMIs

It would be great if you can implement the index hopping correction in Alevin. The software we have works fine if the number of samples is not too large. If had known how to code in C++, I would have implemented part of the code more efficiently using Rcpp. Please let me know if you ever decide to add this feature to Salmon. I am more than happy to help.

Rick,

k3yavi commented 4 years ago

Hi @rfarouni ,

Is it possible to visualize the above two plots on the same scale ? Regarding the few cells not from 10x whitelist, I should have been more clear last time. Basically, what I meant earlier when I said that 10x data is clean is that we do observe some cells from the non whitelist file but they have very few UMI and we discard them anyway.

I am guessing here your motivation is a bit different i.e. considering very low confidence (even with 1 UMI) barcodes, while generally we discard anything below 10 as noise.

Thanks a lot for offering to help with index-hopping idea. I agree, it'd be great to include the model in the alevin framework. Currently I just got the gist of your paper, let us go through the paper in a bit more detail and we'll get back to you as soon as we have some free cycles for the integration.

rfarouni commented 4 years ago

Hi Avi,

Yes once I remove cell barcodes with 1 UMI counts I am left only with 12 or so. For the vast majority of uses I don't think that is an issue. It seems to me that the default setting discards barcodes with less than 10 reads, not UMIs. To make the plot more comparable I discarded barcodes with 1 UMI only in both datasets and observations with low coverage in the existing data. Here is the plot

The histogram is in the log10 scale so it might not be readily apparent but there is large difference in total UMI counts: 14k (alevin) vs 126K (Existing),

I will inspect this further to see what can give rise to such a large difference

Thanks!

k3yavi commented 4 years ago

Oh wow 14k v 126k is indeed a big difference, is it possible to share the Alevin log for your run ? From the logs you attached it's not clear what's the mapping rate. May I also ask to look at another log file inside the logs folder, called salmon_quant.log. that would have more information regarding the mapping rate.

rfarouni commented 4 years ago

Here you go

[2020-06-04 17:55:11.700] [alevinLog] [info] set CITE-seq minScoreFraction parameter to : 0.797619 [2020-06-04 17:55:11.700] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-04 17:55:11.700] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-04 17:55:11.700] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-04 17:55:11.706] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-04 17:56:25.135] [alevinLog] [info] Done barcode density calculation. [2020-06-04 17:56:25.135] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-04 17:56:25.229] [alevinLog] [info] Forcing to use 100000 cells [2020-06-04 17:56:25.368] [alevinLog] [info] Throwing 0 barcodes with < 1 reads [2020-06-04 17:56:25.388] [alevinLog] [info] Total [32m95377[0m(has [32m11[0m low confidence) barcodes [2020-06-04 17:56:25.577] [alevinLog] [info] Done True Barcode Sampling [2020-06-04 17:56:25.698] [alevinLog] [info] Total 1.2299% reads will be thrown away because of noisy Cellular barcodes. [2020-06-04 17:56:29.508] [alevinLog] [info] Done populating Z matrix [2020-06-04 17:56:29.545] [alevinLog] [info] Total 118774 CB got sequence corrected [2020-06-04 17:56:29.557] [alevinLog] [info] Done indexing Barcodes [2020-06-04 17:56:29.557] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-04 17:56:29.557] [alevinLog] [info] Used Barcodes except Whitelist: 88156 [2020-06-04 17:56:30.294] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-04 17:56:30.294] [alevinLog] [info] parsing read library format [2020-06-04 17:57:36.339] [alevinLog] [info] Starting optimizer

[2020-06-04 17:57:37.051] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-04 17:57:37.051] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-04 17:57:37.338] [alevinLog] [info] Total 23937.00 UMI after deduplicating. [2020-06-04 17:57:37.338] [alevinLog] [info] Total 91 BiDirected Edges. [2020-06-04 17:57:37.338] [alevinLog] [info] Total 82 UniDirected Edges. [2020-06-04 17:57:37.338] [alevinLog] [warning] Skipped 82268 barcodes due to No mapped read [2020-06-04 17:57:37.341] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-04 17:57:37.348] [alevinLog] [warning] Num Low confidence barcodes too less 1 < 10.Can't performing whitelisting; Skipping [2020-06-04 17:57:37.348] [alevinLog] [info] Finished optimizer

salmon_quant.log

[2020-06-04 17:55:11.700] [jointLog] [info] setting maxHashResizeThreads to 7 [2020-06-04 17:55:11.700] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored. [2020-06-04 17:55:11.700] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation. [2020-06-04 17:55:11.700] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65 [2020-06-04 17:55:11.700] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter. Disabling range-factorized equivalence classes. [2020-06-04 17:55:11.700] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35. [2020-06-04 17:55:11.700] [jointLog] [info] Using default value of 0.797619 for minScoreFraction in Alevin Using default value of 0.6 for consensusSlack in Alevin [2020-06-04 17:56:30.294] [jointLog] [info] There is 1 library. [2020-06-04 17:56:30.355] [jointLog] [info] Loading pufferfish index [2020-06-04 17:56:30.355] [jointLog] [info] Loading dense pufferfish index. [2020-06-04 17:56:30.355] [jointLog] [info] done [2020-06-04 17:56:30.355] [jointLog] [info] Index contained 64 targets [2020-06-04 17:56:30.355] [jointLog] [info] Number of decoys : 0 [2020-06-04 17:57:36.305] [jointLog] [info] Computed 64 rich equivalence classes for further processing [2020-06-04 17:57:36.305] [jointLog] [info] Counted 39,818 total reads in the equivalence classes [2020-06-04 17:57:36.305] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0 [2020-06-04 17:57:36.305] [jointLog] [warning] Found 1354 reads with N in the UMI sequence and ignored the reads. Please report on github if this number is too large [2020-06-04 17:57:36.305] [jointLog] [info] Mapping rate = 0.0762793%

[2020-06-04 17:57:36.305] [jointLog] [info] finished quantifyLibrary()

k3yavi commented 4 years ago

Thanks ! Aha, so indeed the mapping rate is super low, that explains it. Mapping rate = 0.0762793%

k3yavi commented 4 years ago

Lemme work with the reads you forwarded, is it possible to share the guide sequence as well ? Otherwise I won't be able to check the mapping rate.

k3yavi commented 4 years ago

I wonder if the max 1-edit distance restriction is too stringent for 21 length barcodes. One important flag to play with is the --minScoreFraction. The basic rule to set that is define here. The gist being say if we wan't max k-edit we allow all the reads above the following threshold score (as in the log ):

[2020-06-04 17:55:11.700] [alevinLog] [info] set CITE-seq minScoreFraction parameter to : 0.797619

i.e. we use the equation (max_score + edit_cost) - 0.5) / max_score where max_score = 2 length of barcode = 2 21 = 42, and edit_cost= min( k * (mismatch - match), k * (go + ge - match); mismatch penalty = -4 match = 2 go gap open penalty = -4 ge gap extend penalty = -2.

For k=1, we had edit_cost = 8 leading to automatic setting of minScoreFraction of 0.797619. we have looked at 15 length barcodes, but it's possible longer barcodes might have more sequencing error. Let's try allowing more edits i.e. k=2, by setting --minScoreFraction 0.607 and see if it improves the mapping rate.

rfarouni commented 4 years ago

I will be trying your suggestion out. I might be able to share with you a toy dataset with a fewer number of guides. I will update you as soon as I get it.

k3yavi commented 4 years ago

Thanks @rfarouni ! A small dataset with few thousand reads would be great to have, the one I currently had was too few to test things on.

rfarouni commented 4 years ago

With --minScoreFraction 0.607 I get a way much better mapping rate. I wonder if there is way to determine the optimal value empirically?

[2020-06-05 13:08:20.990] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-05 13:08:20.990] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-05 13:08:20.990] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-05 13:08:20.996] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-05 13:09:43.478] [alevinLog] [info] Done barcode density calculation. [2020-06-05 13:09:43.478] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-05 13:09:43.576] [alevinLog] [info] Forcing to use 100000 cells [2020-06-05 13:09:43.653] [alevinLog] [info] Throwing 0 barcodes with < 1 reads [2020-06-05 13:09:43.673] [alevinLog] [info] Total [32m95377[0m(has [32m11[0m low confidence) barcodes [2020-06-05 13:09:43.875] [alevinLog] [info] Done True Barcode Sampling [2020-06-05 13:09:44.027] [alevinLog] [info] Total 1.2299% reads will be thrown away because of noisy Cellular barcodes. [2020-06-05 13:09:48.338] [alevinLog] [info] Done populating Z matrix [2020-06-05 13:09:48.376] [alevinLog] [info] Total 118774 CB got sequence corrected [2020-06-05 13:09:48.389] [alevinLog] [info] Done indexing Barcodes [2020-06-05 13:09:48.389] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-05 13:09:48.389] [alevinLog] [info] Used Barcodes except Whitelist: 88156 [2020-06-05 13:09:49.130] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-05 13:09:49.132] [alevinLog] [info] parsing read library format [2020-06-05 13:11:01.670] [alevinLog] [info] Starting optimizer

[2020-06-05 13:11:02.377] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-05 13:11:02.377] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-05 13:11:04.408] [alevinLog] [info] Total 322945.00 UMI after deduplicating. [2020-06-05 13:11:04.408] [alevinLog] [info] Total 15972 BiDirected Edges. [2020-06-05 13:11:04.408] [alevinLog] [info] Total 176951 UniDirected Edges. [2020-06-05 13:11:04.408] [alevinLog] [warning] Skipped 12046 barcodes due to No mapped read [2020-06-05 13:11:04.415] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-05 13:11:04.455] [alevinLog] [warning] Num Low confidence barcodes too less 8 < 10.Can't performing whitelisting; Skipping [2020-06-05 13:11:04.455] [alevinLog] [info] Finished optimizer

rfarouni commented 4 years ago

But now there are a lot of barcodes that are not in the whitelist

Also with the default setting of --freqThreshold, no CB correction gets done

[2020-06-05 13:37:57.300] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-05 13:37:57.300] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-05 13:37:57.300] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-05 13:37:57.306] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-05 13:39:18.516] [alevinLog] [info] Done barcode density calculation. [2020-06-05 13:39:18.516] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-05 13:39:18.623] [alevinLog] [info] Forcing to use 100000 cells [2020-06-05 13:39:19.364] [alevinLog] [info] Throwing 49909 barcodes with < 10 reads [2020-06-05 13:39:20.065] [alevinLog] [info] Total [32m50092[0m(has [32m201[0m low confidence) barcodes [2020-06-05 13:39:20.928] [alevinLog] [info] Done True Barcode Sampling [2020-06-05 13:39:21.057] [alevinLog] [info] Total 1.70493% reads will be thrown away because of noisy Cellular barcodes. [2020-06-05 13:39:23.175] [alevinLog] [info] Done populating Z matrix [2020-06-05 13:39:23.175] [alevinLog] [info] Total 0 CB got sequence corrected [2020-06-05 13:39:23.175] [alevinLog] [info] Done indexing Barcodes [2020-06-05 13:39:23.175] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-05 13:39:23.175] [alevinLog] [info] Used Barcodes except Whitelist: 0 [2020-06-05 13:39:23.278] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-05 13:39:23.278] [alevinLog] [info] parsing read library format [2020-06-05 13:40:35.769] [alevinLog] [info] Starting optimizer

[2020-06-05 13:40:36.476] [alevinLog] [warning] mrna file not provided; using is 1 less feature for whitelisting [2020-06-05 13:40:36.476] [alevinLog] [warning] rrna file not provided; using is 1 less feature for whitelisting [2020-06-05 13:40:37.933] [alevinLog] [info] Total 227279.00 UMI after deduplicating. [2020-06-05 13:40:37.933] [alevinLog] [info] Total 14712 BiDirected Edges. [2020-06-05 13:40:37.933] [alevinLog] [info] Total 173086 UniDirected Edges. [2020-06-05 13:40:37.933] [alevinLog] [warning] Skipped 5326 barcodes due to No mapped read [2020-06-05 13:40:37.936] [alevinLog] [info] Clearing EqMap; Might take some time. [2020-06-05 13:40:37.962] [alevinLog] [warning] Num Low confidence barcodes too less 165 < 200.Can't performing whitelisting; Skipping [2020-06-05 13:40:37.962] [alevinLog] [info] Finished optimizer

k3yavi commented 4 years ago

Thanks @rfarouni for the updates.

With --minScoreFraction 0.607 I get a way much better mapping rate. I wonder if there is way to determine the optimal value empirically?

Glad to hear that, may I ask what percent of the reads are mapping now ? It's not clear from the alevin logs you shared but I think the total number of deduplicated UMIs are similar to your baseline experiment. I think defining an optimal empirical threshold is a great idea but the issue is that 21 length barcodes are kind of in the middle i.e. a tad longer than the regular barcodes and somewhat smaller than a full read. The full read alignment process indeed allows more erroneous reads to map but 21 is a bit too short to work with. @rob-p might have more thoughts on this one.

But now there are a lot of barcodes that are not in the whitelist

Thanks again for checking this, it is indeed concerning. However, as I was mentioning earlier in a regular single-cell experiment we end up throwing away almost all of these very low frequency count cellular barcodes. I'd say even 45 reads CBs are most probably a noise and will be filtered away, because only a fraction of the reads will map and after deduplication it'll result in significantly low count in 1 cellular barcode.

Also with the default setting of --freqThreshold, no CB correction gets done

I can check why is this happening, let me know once you have a toy dataset to play with.

rfarouni commented 4 years ago

Mapping rate = 73.2315%. The row numbers indicate UMI counts. Why would a cell barcode with 45 UMIs be considered noise in this context? If we are dealing with expression data, then I can see why, but not when we have so few features.

k3yavi commented 4 years ago

Yes, absolutely, above I meant in scRNA-seq context, my apologies if it was not clear. Here, you are right we might have to think of ways to provide whitelist cellular barcode. One another thing you can try is providing the full 10x expected whitelisted cellular barcodes to alevin through --whitelist command.

rfarouni commented 4 years ago

I see. I will try providing the whitelist and see what happens. Once I get my hands on the toy dataset, I will share it with you as well :)

rfarouni commented 4 years ago

When I add the whitelist using --whitelist command I get the following error

[2020-06-09 09:01:07.401] [alevinLog] [error] Wrong whitelist provided Please check https://salmon.readthedocs.io/en/develop/alevin.html#whitelist

k3yavi commented 4 years ago

Oh man, too many sanity checks over the years, can you just remove one cellular barcode from the full list and try again?

Basically, many people have confused this flag by providing the full 10x whitelist without knowing the consequences, that's why the warning. Here our use case is specific and it should not matter.

rob-p commented 4 years ago

@k3yavi , we should add like a —force-barcodes flag for this.

rfarouni commented 4 years ago

Thanks! Worked with a Mapping rate = 73.4157%. See log below. However, I only get half the number of mapped reads per cell-feature. I still need to examine the existing alignment to understand why

[2020-06-09 12:31:05.494] [alevinLog] [info] Found 64 transcripts(+0 decoys, +0 short and +0 duplicate names in the index) [2020-06-09 12:31:05.494] [alevinLog] [info] Filled with 64 txp to gene entries [2020-06-09 12:31:05.494] [alevinLog] [info] Found all transcripts to gene mappings [2020-06-09 12:31:05.499] [alevinLog] [info] Processing barcodes files (if Present)

[2020-06-09 12:32:20.000] [alevinLog] [info] Done barcode density calculation. [2020-06-09 12:32:20.000] [alevinLog] [info] # Barcodes Used: [32m52200250[0m / [31m52200250[0m. [2020-06-09 12:32:20.285] [alevinLog] [info] Done importing white-list Barcodes [2020-06-09 12:32:20.423] [alevinLog] [warning] Skipping 672237 Barcodes as no read was mapped [2020-06-09 12:32:20.578] [alevinLog] [info] Total 65042 white-listed Barcodes [2020-06-09 12:32:20.578] [alevinLog] [info] Sorting and dumping raw barcodes [2020-06-09 12:32:21.060] [alevinLog] [info] Total 5.06742% reads will be thrown away because of noisy Cellular barcodes. [2020-06-09 12:32:23.856] [alevinLog] [info] Done populating Z matrix [2020-06-09 12:32:23.882] [alevinLog] [info] Total 79207 CB got sequence corrected

[2020-06-09 12:32:23.893] [alevinLog] [info] Done indexing Barcodes [2020-06-09 12:32:23.893] [alevinLog] [info] Total Unique barcodes found: 604589 [2020-06-09 12:32:23.893] [alevinLog] [info] Used Barcodes except Whitelist: 71340 [2020-06-09 12:32:24.004] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify

[2020-06-09 12:32:24.004] [alevinLog] [info] parsing read library format [2020-06-09 12:33:33.719] [alevinLog] [info] Starting optimizer

[2020-06-09 12:33:35.712] [alevinLog] [info] Total 161852.00 UMI after deduplicating. [2020-06-09 12:33:35.712] [alevinLog] [info] Total 14936 BiDirected Edges. [2020-06-09 12:33:35.712] [alevinLog] [info] Total 177402 UniDirected Edges. [2020-06-09 12:33:35.712] [alevinLog] [warning] Skipped 12422 barcodes due to No mapped read [2020-06-09 12:33:35.719] [alevinLog] [info] Finished optimizer

rfarouni commented 4 years ago

By the way, does alevin try to correct barcodes that are not in the whitelist but are 1 edit distance from the barcodes in the whitelist?

Rick,

k3yavi commented 4 years ago

Yes, if we provide a whitelist externally then Alevin will try and correct barcodes not in the whitelist and are 1-edit distance from them.