broadinstitute / CellBender

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.
https://cellbender.rtfd.io
BSD 3-Clause "New" or "Revised" License
297 stars 54 forks source link

Significance of differences in PCA plot and average_counts_removed_per_cell metric with different settings #291

Open RvdKwast opened 1 year ago

RvdKwast commented 1 year ago

Hello,

I really like CellBender and have been varying the input variables to try to figure out which settings fit my institute's samples (using v0.3.0 on Windows so unfortunately no html report). The defaults generally do not work well for me, but it starts looking quite good with some adjustments, especially lowering the learning rate to 1e-5 or sometimes even 5e-6. But to make sure I am chosing the correct settings, I'd love your input on the following 2 points:

1) What is better: improving the clustering in the PCA plot or improving the training and cell probability plots? I'm finding that changing one variable, especially learning-rate, can improve the training and cell probability plots, while having the opposite effect on the PCA (less distinct clusters). Below is an example with learning rate to 2e-5 left and 1e-5 right with less distinct clusters.

image

2) I like the metrics file and figured that the average_counts_removed_per_cell metric should ideally be approximately as high as the empty droplet plateau in the UMI curve. Is this correct? If so, I thought this metric could provide some accessable quantitative measure to identify when the algorithm is removing too much and help figure out the appropriate FPR after optimizing the other parameters first. Below is an example of 4 different settings I tested for one of my samples and my interpretation of the empty droplet plateau and output.

Your input would be very much appreciated!

Best regards, Reggie

230916_Heart_230509_CB values and assesment

230916_Heart_230509_CB pdfs

sjfleming commented 1 year ago

Hi @RvdKwast , thanks for writing in.

  1. What is better: improving the clustering in the PCA plot or improving the training and cell probability plots?

Improving the training and cell probability plots should definitely be the priority.

  1. average_counts_removed_per_cell metric should ideally be approximately as high as the empty droplet plateau in the UMI curve. Is this correct? If so, I thought this metric could provide some accessable quantitative measure to identify when the algorithm is removing too much and help figure out the appropriate FPR after optimizing the other parameters first.

Your intuition is very good, but I think there is a little bit of extra nuance as well. If the noise model were "ambient only", then the average_counts_removed_per_cell should very closely match the counts in the empty droplets. But the noise model here also contains chimeric barcode swapping noise, which is a type of noise that's proportional to the total number of counts in each droplet. So this kind of noise (if CellBender thinks it plays a role in the dataset) can cause the average_counts_removed_per_cell to be larger than the number of counts in empty droplets.

But what you've observed is something very interesting that I cannot say I've seen before. I've not seen a case where the average_counts_removed_per_cell changes like that for different parameter settings. (All those runs used the same --fpr ... so I'd expect them to all be pretty similar...)

What happens if you use a --learning-rate 2e-5 with everything else auto (in terms of average_counts_removed_per_cell)?