Best way to quantify all CBs with alevin to use with SoupX

COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

https://combine-lab.github.io/salmon

GNU General Public License v3.0

747 stars 159 forks source link

Best way to quantify all CBs with alevin to use with SoupX #538

Open deevdevil88 opened 4 years ago

deevdevil88 commented 4 years ago

hi! while this has been discussed in detail in #379 , there have been many releases of Alevin since then. So i am just a bit confused, if i want to generate a quant matrix of all CBs including those in the range of 1-10 reads for use with SoupX, how do i do this in the most streamlined way. As there has been a lot of discussion about this and many release of alevin since then.

Will using FreqThreshold 0 --maxNumBarcodes 4294967295 do the trick? or do i also need to use --KeepCBFraction 1.0

or do i need to do as suggested by @alexvpickering in the issue above "Run alevin with standard options, then parse raw_cb_frequency.txt for a sample of 1-10UMI CBs and using them as input to --whitelist option for additional run of alevin with --freqThreshold 0 --maxNumBarcodes 4294967295"

Thanks, Devika

deevdevil88 commented 4 years ago

Hi! we compared the adjusted counts results from Soupx if the soup profile was done with 1-100 UMIs or just 10-100 UMIs per cell and actually we dont see much difference (R=0.99). So i dont need the FreqThreshold 0 argument. Just all barcodes if i use KeepCBfraction 1.0 . So i was wondering if i estimated the number of BCs which are in the Soup if i removed the CBs in the 0-10 range and then use that as the number for the maxNumbarcodes , then i should be able to get the CBs i need for SoupX?

k3yavi commented 4 years ago

Hi @deevdevil88 ,

Thanks for your interest in alevin. --keepCBFraction 1.0 should ideally be enough for a general use case to quantify most of the non-junk cellular barcode however you can tweak things using other flags.

--freqThreshold: defaults to 10, is basically a filter to toss away CB with < 10 reads. Change it according to your need.

--maxNumBarcodes default to 100k barcodes. If after filtering with freqThreshold value there are still more than 100k CB, alevin quantifies only top 100k CB. In a typical use case it's very liberal threshold and doesn't need more barcodes, however feel free to change the number to any big number in case you wan't more barcodes, although the usefulness of that can be argued.

Hope i helps !

alexvpickering commented 4 years ago

Hi @deevdevil88,

The challenges I faced with this issue made me switch over to kallisto which has some nice advantages as far as speed. I didn't see any obvious affects on quality for my samples although I did have to re-implement some of the auto-detection that alevin and salmon do for you.

I personally observed some strange behaviour with Soupx - visually apparent differences in gene expression between samples that at the time I felt were artefactual of the adjustment by Soupx. I eventually rolled-my-own strategy where I omitted ambient outlier genes from differential expression. Ambient outliers were defined by taking droplets with UMI counts <10 with using a boxplot in R to define outliers. The osca.bioconductor.org recommendations ended up being very similar. They also describe some of the pitfalls of adjusting counts.

Best of luck! Always appreciative of all the great work and responsiveness of @k3yavi and the team!

deevdevil88 commented 4 years ago

Thanks for the replies @alexvpickering @k3yavi As we have very high Ambient RNA and Doublet rate, we find that some of our key cell type genes for different neurotransmitters and glial expression are everywhere and this is to some extent affecting our clustering. While removing the doublets with 3 methods and taking a union of atleast 2 to get rid of cells helped, we still need to do something about the background. So far the results seem ok, but we havent finished all our downstream analyses on the adjusted data to be sure that SoupX isnt doing something to the data that is weird.

Thanks for the suggestions Avi, as I am no longer worried that I need CBs with UMIs in the range of 1-10 for SoupX i am using the default FreqThreshold 1 and only playing with MaxNumBarcodes and using KeepCBfraction 1.

rob-p commented 3 years ago

@deevdevil88,

As an update to this, you can now use the alevin -> alevin-fry pipeline to quantify with different strategies for filtering. If you're using a technology with an external permit list (like 10x chromium), you can recover and quantify unfiltered cells as well as of version 0.2.0 using the --unfiltered-pl flag.

Best, Rob

deevdevil88 commented 3 years ago

@rob-p that's awesome. Thanks for pointing this out! I shall try alevin-fry out.

rob-p commented 3 years ago

@deevdevil88,

Great! Please do let us know if you have any questions or run into any issues when testing it out :).