COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
747 stars 159 forks source link

Alevin : Normalization of samples with uneven sequencing depth / Batch Correction. #305

Open juugii opened 5 years ago

juugii commented 5 years ago

Dear salmon team and @k3yavi,

I am an happy user of alevin on scRNAseq data (10x). In some circumstances, I have to compare samples from very uneven sequencing depth (less than 100k reads per cell vs 1M reads per cell,), which naturally produces a huge "batch effect", some transcripts being impossible to detect at a lower sequencing depth.

For 10x datasets, cellranger offers to compensate this kind of difference across datasets by subsampling, and I should tell it works very well. I tried this strategy before running alevin, by the use of python scripts subsampling the fastq files prior any alevin quasi-mapping. However, all my attempts failed, as all downstream analyses shown that this strategy didn't correct the batch effect at all with alevin (visualization on dimension reduction/UMAP show a huge difference between batches, while the cellranger subsampling method clearly corrects it).

An important point is by default, the cellranger method corrects according to the amount of mapped reads between samples. So ideally, I should have to evaluate the difference in term of mapped reads to compute a subsampling coefficient (fraction of reads to keep). What correct alevin metrics should I use to estimate such a coefficient?

Considering the probalistic approach of alevin's transcript quantification, I am wondering if it could be any better way to normalize the samples to account for this kind of issue? Also, any option to subsample the reads (take a read into account under a certain probability) during alevin quantification would be of great help in my case.

Thank you for your work and your time, Best, juugii

k3yavi commented 5 years ago

Hi @juugii , Thanks a lot for starting an interesting discussion related to Alevin. It does sound really interesting. If I may, I'd like to ask a couple of questions before going deeper into your questions.

juugii commented 5 years ago

Hi @k3yavi, Many thanks for you prompt answer, once again.

When you say you try subsampling the Fastq, did you sample randomly across the full Fastq or chose the top X reads.

Yes, I did perform a random subsampling, ie. taking a read with a p probability while reading the fastq files, p being the subsampling coefficient I did mention (pE[0;1]). An implementation of this approach as an option during the transcript quantification would be great. I can provide you with the simple python script I use for the subsampling, but I am not sure if it is the proper way to subsample during alevin quantification.

when you say cellranger subsampling, do you mean the cellranger aggregate pipeline?

Yes, sorry for not clearly stating it. I did use the cellranger aggregate function indeed, which by default subsample the expression matrices with high sequencing depth depending on amount of mapped reads, if I understand well.

Use Alevin w/o any modification to the fastq on both of your sample to generate the gene count matrices.

I already did that, in downstream analyses I have a batch effect issue related to the sequencing depth.

that's why we recommend using the Seurat package downstream of the Alevin quantified matrices

I have some experience with downstream analyses with Seurat, Pagoda, Scater, scanpy and a few other tools, and I am aware of batch correction methods like CCA or MNN. But that is not what I am looking for here. I did both CCA and MNN but I loose some important information in the resulting eigenspaces or corrected matrix. I believe the proper way to correct my batch effect is to simply fix the difference between my two libraries, ie. the sequencing depth in this case.

As I explained in my first message, cellranger aggregate (subsampling based on the amount of mapped reads) works very well in my case, correct the effect without any loss or modification of important genes in our scientific question. Not CCA or MNN. I would like to be able to do the same from the alevin quantifications. So I am looking for a proper way to apply a correction before/during/after the alevin quantification, in a way similar to what cellranger do with STAR. Alternatively, could a subsampling covariate be added to the probalistic quantification model of alevin (if I understand it well), in sort that such a discrepency bewteen samples would be corrected?

I did look at the mappedUMI file:

image

So an option you would recommend is to simply compute the subsampling coefficient for a median ratio bewteen samples? I am expecting quite uneven distributions/variance in the mappedUMI between samples (partly due to a huge difference in term of proliferation that occur with a fraction of cells in only one sample). Still, cellranger aggregate correct it nicely.

Thanks again for your prompt answer and comments.

k3yavi commented 5 years ago

Thanks @juugii for the detailed response and clarifying my doubts. It surely is possible to lose some information when projecting the data on lower dimensional space since none of the methods, CCA and MNN, are lossless. I do see your point and would like to understand more about the depth normalization problem you are facing with different experiments.

I tried to search the algorithm used by cellranger to aggregate the counts and it looks like the following:

Quoting your text from above:

Alternatively, could a subsampling covariate be added to the probabilistic quantification model of alevin

I think we can definitely work on correcting the subsampling bias in the probabilistic model of Alevin but we might need a little more understanding of what's going on with the cellranger and why your way of aggregation is not working as intended. Unfortunately, I think we have to dig deeper into the cellranger codebase to really understand that and if possible, might need some subset of the relevant data to replicate your issue and work on improving that. Also, I am wondering, what's your way of checking the batch effect and comparing Alevin and Cellranger aggregation? Just a guess, the tSNE plots of the two samples are separating out very clearly, if possible if you can share the figures and the analysis (may be a notebook) to us, we can help and investigate more thoroughly.

k3yavi commented 5 years ago

Hi @juugii ,

I tried to simulate the coverage/depth bias by subsampling a fraction of the data and then correcting it using a very trivial approach, which seems to be working fine, at least for the dataset I simulated. It would be great if you can also test the same on your dataset too and let us know how it looks for your use case. The gist of the notebook can be found here.

juugii commented 5 years ago

Thanks a lot, I'll try it.

juugii commented 5 years ago

Thanks a lot @k3yavi for the code. It doesn't seems to work, I don't think it normalize in a proper way: genes that are highly expressed (RPS/L, ACTB, B2M, etc...) are much more duplicated during the PCR than lowly expressed genes. So lots of reads, but less much counts after quantification (and summarization at the UMI level). So if one sequences deeply any sample, it would affect in a very uneven way both the cells and genes, depending on the level of transcriptional activity of the cell (cycling...) and the level of expression of a given gene. Then subsampling the fastq files before any quantification would be more fair, what do you think?

You can also see the notes on: https://github.com/theislab/scanpy/pull/340