pseudoBulk as a reference

bio-la commented 5 years ago

Hi Aaron, cool to see you're involved in this project! I want to create reference data to use with SingleR, is it ok to generate pseudobulks from an annotated SC dataset? And would I need to TPM the pseudobulks to create the reference or what normalization should I use then? thanks!

LTLA commented 5 years ago

is it ok to generate pseudobulks from an annotated SC dataset?

That's an interesting question. I would guess that it would still work well, and indeed, SingleR's built-in reference datasets are created from bulk RNA-seq studies. It would definitely be faster than passing in the single-cell profiles directly.

Having said that, knowledge of the single-cell profiles can be powerful if the cell types are not well separated and have irregular shapes/widths in high dimensions. For example, if you have a diffuse cell type and a neighboring tight cell type, cells of the former type may be equidistant to means of both clusters. In such cases, a pseudo-bulk reference would be less effective than using the individual cells, which would capture the "shape" of the cell distribution.

I don't know whether this is really an issue in practice. I suspect that most cell types are sufficiently well-separated that this is not a problem, in which case you can just use the pseudo-bulk samples instead of single-cell profiles as the reference. Just pragmatically, it is easier to run SingleR() on thousands of pseudo-bulk samples compared to millions (or billions?) of individual cells. Perhaps @dviraran or @dtm2451 would have more comments here.

And would I need to TPM the pseudobulks to create the reference or what normalization should I use then?

You'd want to make the ranks comparable, so if you're working with read data, I'd suggest TPM'ing or FPKM'ing them if you want to compare them to UMI data.

dviraran commented 5 years ago

Really cool that he is involved in this project :)

@LTLA - I agree with everything you wrote. I am a fan of pseodubulk, and have gotten pretty good results with it. Of course, it depends on what the goal is, but in general, it works pretty well.

bio-la commented 5 years ago

Hi both!

Having said that, knowledge of the single-cell profiles can be powerful if the cell types are not well separated and have irregular shapes/widths in high dimensions. For example, if you have a diffuse cell type and a neighboring tight cell type, cells of the former type may be equidistant to means of both clusters. In such cases, a pseudo-bulk reference would be less effective than using the individual cells, which would capture the "shape" of the cell distribution.

I can tell you feel my pain. You know that in the same dataset you can have a bit of both worlds and I am afraid there's no obvious solution, I just have to try different strategies. Even when doing pseudobulks, if N cells are classified as "Cell.type_x" but for some (technical?) reason they split into 2 clusters, I would still have to do 2 pseudobulks instead of one. But what if one of these clusters had N-4 cells and the other has the remaining 4. Do I really believe the tiny one?

Hence I do agree that maybe the solution is to do it cell-wise. I'm testing a couple of things for fun now. Your updates make SingleR capable of handling the single-cell-reference scenario, right?

thanks!

LTLA commented 5 years ago

It is possible to have your cake and eat it too, via k-means clustering.

k-means gets a bad rap for single-cell data, but the truth is that it remains one of the best methods for vector quantization. Check out the comments here. The idea would be to perform k-means clustering within each of your reference cell labels, asking for, say, sqrt(n) clusters for each label where n is the number of cells for that label. This effectively compresses your data for efficient operation of SingleR while still preserving information about the "shape" of the cluster.

If you try this and find that this works well, we could add a mode to do this automatically.

Oh, and:

Your updates make SingleR capable of handling the single-cell-reference scenario, right?

Yes, though it will work faster if you get a more focused (smaller) marker set. The default marker detection seems to be too generous, i.e., the default marker set is too large to take advantage of some of the speed-ups. It is not hard to implement a more stringent detection scheme, but I aimed to minimize any breaks with other people's work by suddenly changing the algorithm.

friedue commented 5 years ago

You'd want to make the ranks comparable, so if you're working with read data, I'd suggest TPM'ing or FPKM'ing them if you want to compare them to UMI data.

Is there a discussion of the type of aggregation function (sum, mean, median, ...) for generating the pseudo-bulk that you recommend reading? Sum is the best way of getting integers, but it just seems so dependent on the number of cells within a given scRNA-seq sample that happen to represent that cell type.

LTLA commented 5 years ago

Most of the pseudo-bulk literature focuses on differential expression analyses. (Some self-promotion here.) For that, sums have the most easily predictable statistical properties and are most obviously compatible with downstream tools like edgeR and voom. In this case:

It doesn't matter whether you use the sum or the mean here; you should be normalizing the pseudo-bulk reference samples prior to entry into SingleR(), and this would rescale everything anyway.
You may or may not prefer to sum the normalized expression values instead of the raw counts. Thie "advantage" of this approach is that the pseudo-bulk sample is not dominated by large cells, but the disadvantage is that you increase noise from small cells. (For DE analyses, there are more effects in play that discourage this practice, based on the total RNA content and mean-variance issues.)
I would prefer the mean over the median, as the latter runs the risk of getting filled up with zeroes for low-abundance genes with lots of drop-outs. If you had a gene that was specific for a cell population but was only detected in 49% of cells... well, that's too bad.
The geometric mean isn't really worth considering due to all the zeroes. If you're working in log-space, the mean of log-values is sort of the geometric mean, so it does implicitly get used if you're dealing with log-expression values. But if you're handling counts, there's no reason to use this.

LTLA commented 4 years ago

I have now added some instructions on the use of aggregateReference to the vignette.

SingleR-inc / SingleR

pseudoBulk as a reference #3