Alevin : Normalization of samples with uneven sequencing depth / Batch Correction.

Dear salmon team and @k3yavi,

I am an happy user of alevin on scRNAseq data (10x). In some circumstances, I have to compare samples from very uneven sequencing depth (less than 100k reads per cell vs 1M reads per cell,), which naturally produces a huge "batch effect", some transcripts being impossible to detect at a lower sequencing depth.

For 10x datasets, cellranger offers to compensate this kind of difference across datasets by subsampling, and I should tell it works very well. I tried this strategy before running alevin, by the use of python scripts subsampling the fastq files prior any alevin quasi-mapping. However, all my attempts failed, as all downstream analyses shown that this strategy didn't correct the batch effect at all with alevin (visualization on dimension reduction/UMAP show a huge difference between batches, while the cellranger subsampling method clearly corrects it).

An important point is by default, the cellranger method corrects according to the amount of mapped reads between samples. So ideally, I should have to evaluate the difference in term of mapped reads to compute a subsampling coefficient (fraction of reads to keep). What correct alevin metrics should I use to estimate such a coefficient?

Considering the probalistic approach of alevin's transcript quantification, I am wondering if it could be any better way to normalize the samples to account for this kind of issue? Also, any option to subsample the reads (take a read into account under a certain probability) during alevin quantification would be of great help in my case.

Thank you for your work and your time, Best, juugii

Hi @juugii , Thanks a lot for starting an interesting discussion related to Alevin. It does sound really interesting. If I may, I'd like to ask a couple of questions before going deeper into your questions.

When you say you try subsampling the Fastq, did you sample randomly across the full Fastq or chose the top X reads. One can imagine that if you simply, sample just the top X reads and the Fastq has not been generated in random order then a certain subset of CB will get all the reads and you'd still observe the coverage bias in them. I'd suggest can you please try subsampling randomly across the full Fastq if you haven't tried that already.
re: subsampling coefficient: If you are looking for per-CB level mapping rate for your sample that would be very easy to calculate, although getting one number for the full sample might be little tricky since the mapping rate might have large variance across the sample, but it would be an interesting plot to generate, do let us know how it looks in your case. If you run Alevin with --dumpFeatures flag, alevin will generate a file featureDump.txt, whose first column will be the per CB level mapping rate i.e. #mapped reads/#raw reads. If you wan't absolute values for per-CB reads and mapped reads, it should be in the file filtered_cb_frequency.txt and mappedUMI.txt respectively.
re: cellranger subsampling: Correct me if I am wrong, when you say cellranger subsampling, do you mean the cellranger aggregate pipeline? It's possible you are talking about some other step which I am not aware of but if it's aggregate then I think it happens downstream of all the quantification. Indeed coverage bias correction is an important part of the aggregation step but in general it's not the only one and that's why we recommend using the Seurat package downstream of the Alevin quantified matrices. We will be more than happy to write a tutorial on, "how to perform batch correction downstream of Alevin" but in summary the following steps would be the gist of the process.
- Use Alevin w/o any modification to the fastq on both of your sample to generate the gene count matrices. (We have made a major upgrade to the Alevin. We'd recommend using v0.12.0-alpha for now, we are planning to make an official release before the end of this week, currently you can use pre-release. Unfortunately, not available on conda yet).
- Import Alevin count matrices into R using this tutorial .
- Use this to perform the batch correction.
  We do realize it's currently complicated to use things downstream of Alevin and are working constantly on improving the overall pipeline to make the analyses as smooth as possible. If you happen to write a tutorial of your own on doing the analyses, we'd be happy to include that in Alevin tutorial page. However, if you get stuck with any of the above steps do let us know, we'd be more than happy to help in that front too.

Hi @k3yavi, Many thanks for you prompt answer, once again.

When you say you try subsampling the Fastq, did you sample randomly across the full Fastq or chose the top X reads.

Yes, I did perform a random subsampling, ie. taking a read with a p probability while reading the fastq files, p being the subsampling coefficient I did mention (pE[0;1]). An implementation of this approach as an option during the transcript quantification would be great. I can provide you with the simple python script I use for the subsampling, but I am not sure if it is the proper way to subsample during alevin quantification.

when you say cellranger subsampling, do you mean the cellranger aggregate pipeline?

Yes, sorry for not clearly stating it. I did use the cellranger aggregate function indeed, which by default subsample the expression matrices with high sequencing depth depending on amount of mapped reads, if I understand well.

Use Alevin w/o any modification to the fastq on both of your sample to generate the gene count matrices.

I already did that, in downstream analyses I have a batch effect issue related to the sequencing depth.

that's why we recommend using the Seurat package downstream of the Alevin quantified matrices

I have some experience with downstream analyses with Seurat, Pagoda, Scater, scanpy and a few other tools, and I am aware of batch correction methods like CCA or MNN. But that is not what I am looking for here. I did both CCA and MNN but I loose some important information in the resulting eigenspaces or corrected matrix. I believe the proper way to correct my batch effect is to simply fix the difference between my two libraries, ie. the sequencing depth in this case.

As I explained in my first message, cellranger aggregate (subsampling based on the amount of mapped reads) works very well in my case, correct the effect without any loss or modification of important genes in our scientific question. Not CCA or MNN. I would like to be able to do the same from the alevin quantifications. So I am looking for a proper way to apply a correction before/during/after the alevin quantification, in a way similar to what cellranger do with STAR. Alternatively, could a subsampling covariate be added to the probalistic quantification model of alevin (if I understand it well), in sort that such a discrepency bewteen samples would be corrected?

I did look at the mappedUMI file:

So an option you would recommend is to simply compute the subsampling coefficient for a median ratio bewteen samples? I am expecting quite uneven distributions/variance in the mappedUMI between samples (partly due to a huge difference in term of proliferation that occur with a fraction of cells in only one sample). Still, cellranger aggregate correct it nicely.

Thanks again for your prompt answer and comments.

Thanks @juugii for the detailed response and clarifying my doubts. It surely is possible to lose some information when projecting the data on lower dimensional space since none of the methods, CCA and MNN, are lossless. I do see your point and would like to understand more about the depth normalization problem you are facing with different experiments.

I tried to search the algorithm used by cellranger to aggregate the counts and it looks like the following:

As I was saying earlier aggregation step was happening downstream of count/quantification of the gene count matrix, at least in cellranger pipeline.

When doing large studies involving multiple biological samples (or multiple libraries or replicates of the same sample), it is best to run cellranger count on each of the libraries individually, and then pool the results using cellranger aggr
It looks like they have three different modes to normalize the libraries

There are three normalization modes: mapped: (default) Subsample reads from higher-depth libraries until they all have an equal number of confidently mapped reads per cell. raw: Subsample reads from higher-depth libraries until they all have an equal number of total (i.e. raw, mapping-independent) reads per cell. none: Do not normalize at all.
Although it's not clear, what do they mean by equal number of confidently mapped reads per cell, does it mean median reads per cell ? Like you tried to show in the above plot the distribution can be very uneven. But the part that troubles me more is once count information has been generated it has lost the read level information, since we have deduplicated them, then how do they use the read counts to normalize. Unless that is dumped too, not clear.

Quoting your text from above:

Alternatively, could a subsampling covariate be added to the probabilistic quantification model of alevin

I think we can definitely work on correcting the subsampling bias in the probabilistic model of Alevin but we might need a little more understanding of what's going on with the cellranger and why your way of aggregation is not working as intended. Unfortunately, I think we have to dig deeper into the cellranger codebase to really understand that and if possible, might need some subset of the relevant data to replicate your issue and work on improving that. Also, I am wondering, what's your way of checking the batch effect and comparing Alevin and Cellranger aggregation? Just a guess, the tSNE plots of the two samples are separating out very clearly, if possible if you can share the figures and the analysis (may be a notebook) to us, we can help and investigate more thoroughly.

Hi @juugii ,

I tried to simulate the coverage/depth bias by subsampling a fraction of the data and then correcting it using a very trivial approach, which seems to be working fine, at least for the dataset I simulated. It would be great if you can also test the same on your dataset too and let us know how it looks for your use case. The gist of the notebook can be found here.

Thanks a lot, I'll try it.

Thanks a lot @k3yavi for the code. It doesn't seems to work, I don't think it normalize in a proper way: genes that are highly expressed (RPS/L, ACTB, B2M, etc...) are much more duplicated during the PCR than lowly expressed genes. So lots of reads, but less much counts after quantification (and summarization at the UMI level). So if one sequences deeply any sample, it would affect in a very uneven way both the cells and genes, depending on the level of transcriptional activity of the cell (cycling...) and the level of expression of a given gene. Then subsampling the fastq files before any quantification would be more fair, what do you think?

You can also see the notes on: https://github.com/theislab/scanpy/pull/340

COMBINE-lab / salmon

Alevin : Normalization of samples with uneven sequencing depth / Batch Correction. #305