broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
295 stars 54 forks source link

SPLiTseq #226

Open ybaeus opened 1 year ago

ybaeus commented 1 year ago

Can I apply this package to SPLitSeq output from https://github.com/yjzhang/split-seq-pipeline/tree/master?

sjfleming commented 1 year ago

Hi @dpwlek1997 , yes you can definitely use CellBender whenever you have a count matrix. And it looks like their output file format is a DGE matrix, so it might work directly as an input. CellBender can take MTX-formatted matrices. You might be able to use the DGE_unfiltered output folder as the input to CellBender. CellBender does expect CellRanger formatting conventions for the MTX file format, so you might have to tweak a few things. For example, if you input a directory to CellBender like this

cellbender remove-background \
    --cuda \
    --input DGE_unfiltered \
    ...

where DGE_unfiltered is the folder with your outputs, then CellBender will expect to see three files there, named just this way:

DGE_unfiltered
├── barcodes.tsv.gz
├── features.tsv.gz
└── matrix.mtx.gz

so you may have to rename some files.

ybaeus commented 1 year ago

@sjfleming Thanks for your replying! I was able to successfully run the cellbender. But I am having questions about the computational method. Since the cellbender uses empty droplet information (droplet size factor, droplet capture efficiency factor) to compute the total distribution, and SPLiTseq doesn't use any drop-based sequencing method, does this impact the performance of cellbender?

sjfleming commented 1 year ago

Ah, maybe I was a little bit too quick to reply. Let me think about that, it is an interesting question. Reading the SPLiT-seq paper more carefully, I think I see that, in SPLiT-seq, the concept of a "droplet" is replaced by the cell itself. Somehow the cell membrane is permeabilized, and this allows barcode oligos and PCR reagents to enter the cells. Presumably it also allows in/out ambient RNA.

The "droplet efficiency" parameter in cellbender is really all about the efficiency of reverse transcription, which we model as being "droplet-specific". In the case of SPLiT-seq, I think modeling a "cell-specific" reverse transcription efficiency is also probably appropriate. (What if one cell is very permeable, allowing in lots of reverse transcriptase, and another cells is not as permeable? That kind of thing.) The part I'm not sure about is the "empty droplet" part. In cellbender we need to look at empty droplets to infer the ambient RNA makeup. In SPLiT-seq, the concept of "empty droplets" is gone. I suppose at the end of the protocol, you lyse all the cells and perform PCR in bulk to come up with the final library. Then you do your sequencing and you get a huge number of reads. Each read has a barcode which is made of 4 parts. In the paper they say

Four rounds of combinatorial barcoding can yield 21million+ barcode combinations.

But you don't have that many cells... so I am guessing the reads fall into a few categories:

  1. Reads with barcode sequences that are "popular". Popular barcodes are probably inferred to belong to be molecules that came from cells.
  2. Reads with barcode sequences that are "unpopular".

The analysis of SPLiT-seq probably throws out reads that fall into class (2)? In droplet-based sequencing, these reads would probably be the empty droplets... because one way that a read could get an "unpopular" barcode would be by being an ambient RNA. It would be randomly put into a well (with the aqueous solution) and it would get a barcode each round, but it would not always be with the same cell, and so it could end up with a pretty random barcode.

What I'm wondering is... what happens to the data from the "unpopular" barcodes in the SPLiT-seq analysis pipeline? Because I think that's where we want to look for ambient RNA.

sjfleming commented 1 year ago

Also, if you make a "UMI curve" for SPLiT-seq data, what does it look like? Do all the "barcodes" correspond to "cells"? And they all have a large number of UMI counts? Is there any equivalent of "empty droplets" in the count matrix? If the count matrix does include reads from category (2), then maybe we have what we need.

ybaeus commented 1 year ago

"Somehow the cell membrane is permeabilized, allowing barcode oligos and PCR reagents to enter the cells. Presumably it also provides in/out ambient RNA..... " Thanks for the insights. I might be able to use 'unpopular' reads as unfiltered data. I just clearly noticed why my data has so many cells than I was expecting. One fun fact I found out is that nCount x nFeature scatter plots show a split cluster with low counts + high features and I may interpret it as the average of all cells that lysed in solution.

image image

I think it is a great idea to make UMI curve for SPLiTseq data. I haven't, but I will get back to you with it! (might take a week cause i am working on multiple projects simultaneously.) Thanks for discussing it with me!

ayyildizd commented 1 year ago

Hi, A very nice discussion and to bring it up, I want to share my experience with running cellbender on split-seq data. I did the preprocessing using STARsolo with --soloType CB_UMI_Complex setting. So providing positions of all 3 cell barcodes, UMI and whitelists of those 3 round barcodes. In the end STARsolo does cell filtering, so I assume the ones identified as not cells can be used like equivalent of empty droplets in count data. Then I run Cell Bender checking the UMI curves, estimating total droplets and low-count-thresholds to be used. Total number of expected cells were known because this was a published data. After running the cell bender, I did plot UMI curves, marking priors (black horizontal lines) and total droplets included (red vertical line) as you see an example below. The problem is that none of these UMI curves has a plateau as we used to see in droplet based techniques. And that would be a part where cell bender struggles currently right? I would be very happy to be included this discussion about running Cell Bender on Split-seq data as this is currently one of the datasets we are working on.

image

sjfleming commented 1 year ago

Hi @ayyildizd , thanks so much for adding to this discussion. Since I haven't worked with any Split-seq data, this is great to see.

So you say that cellbender did run okay on this kind of sample when you included those "empty droplets"? It looks like you did get an output, if the blue dots on your plot represent cellbender's cell calls. (Or are the blue dots the STARsolo cell calls?)

It does look like the UMI curve is a bit different from what I'd expect from droplet-based data. I could imagine that cellbender might struggle to identify cells (though if the above plot is the cellbender output, it looks like it did fine).

As for the noise model... I will have to think about this. CellBender's noise model with "ambient RNA" and "chimeric barcode swapping" was created with droplet-based experiments in mind, where these concepts really make physical sense and explain what we see in the data. However, when it comes to the mathematical model, it ends up being the case that the "ambient RNA" is like a constant noise source for every cell, and the "barcode swapping" is like a noise source proportional to the total counts in the cell. So this is kind of a flexible model, with one noise source proportional to total counts, and one noise source constant. It may very well be the case that, even if the phenomenology of noise is a little different for Split-seq, this model might still do a pretty good job.

Is there a human-mouse mixture benchmarking experiment for Split-seq somewhere? That might give me a better sense of what the noise in Split-seq really looks like.

ayyildizd commented 1 year ago

Yes, I used raw matrix from starsolo as an input which had all these cells and not cells. Then I did get an output, from celbender. But The blue and grey dots are coming from starsolo filtering, not cell bender. I would say when I look at the priors on UMI curve and the plots cellbender outputs (see below), I am thinking it is not doing a bad job right? I think it might be adapted to this technology indeed. However I am not aware of such benchmarking human-mouse data.

image
sjfleming commented 1 year ago

Great, thanks @ayyildizd ! Yes, the cellbender cell calls in that PDF output there do look quite reasonable to me. It would be hard for cellbender to do a much better job, given that the "empty droplet plateau" is kind of unclear even to us.

ybaeus commented 1 year ago

@ayyildizd Thanks for jumping into the discussion, and your approach is very interesting. I used the output of split-seq-pipeline. So it might be slightly different with the 'filtered' term but let me return after some research.

But in the meantime, just for curiosity, can you also observe the splitted cluster for nFeature vs nCount scatter plot?

And FYI: when I ran the cellbender with the fake total droplets-included number by just adding one, and resulted,,, I know this is not a good approach but just sharing some fun test.

cellbender remove-background \ --input ${INPUT} \ --output ${OUTPUT}/cellbender_test.h5 \ --expected-cells 100000 \ --total-droplets-included 100001 \ --cuda

image