bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
150 stars 17 forks source link

Markers different in BPCells than in Seurat #45

Closed mihem closed 1 year ago

mihem commented 1 year ago

@bnprks Thank you so much for this awesome package. Using BPCells is worlds faster and worlds more memory efficient, which allows analyzing huge datasets. The integration by @gesmira and others into Seurat is also great.

I struggle with finding the top markers. Unfortunately, it's not possible until now to use the BPCells Matrix in Seurat https://github.com/satijalab/seurat/issues/7516.

However, I saw that you also offered the marker_features function which works directly on the BPCells Matrix and can be easily integrated with the Seurat workflow. It's worlds faster, however the results differ. P values maybe different because of Bonferroni, but log2FC should be the same (or is there a mistake in my code?) and the order of the genes should be the same

I created a reproducible example with parts of my own published dataset.

https://osmzhlab.uni-muenster.de/nfs/bpcells/bpcells_debug.html

Maybe @gesmira can also help? I think this would be a very easy way for Seurat to integrate a very fast FindMarkers alternative that works on BPCell Matrix.

Thanks!

Gesmira commented 1 year ago

Hi @mihem, Thanks for bringing this up! FindMarkers on objects with BPCells matrices should now work. Have you updated to the latest v5 version? Currently, though, it doesn't return p-values for each marker. This is due to our use of limma::rankSumTestWithCorrelation which fails when you run rank() on the BPCells RenameDims matrix row. Ben, is there a similar function for calculating rank in the BPCells code base? Also, FindAllMarkers runs a filter for p-values, so it returns no genes since all p-values are currently NA.

We could also think about using this function you mention @mihem, but at first we want to ensure FindMarkers is able to return the same results on the same data for the same method whether the data is a BPCells or dgCMatrix.

bnprks commented 1 year ago

Thanks for the reproducible example here @mihem! A couple notes, mostly regarding your calculation of adjusted p-values and log2FC values based on the BPCells outputs

  1. I don't think the p-values are mismatched here, as if you look at p_val_raw from BPCells and p_val from Seurat they appear to match up for the genes you show. I saw you did p value correction with the "BH" method, whereas I think that Seurat uses bonferroni for p-value correction. BPCells is tested against the builtin wilcox.test() and presto for correctness, so I am fairly confident the raw p-values are correct.
  2. Calculating log2FC is not always a well-defined formula, hence why BPCells doesn't return it directly. The main issue comes down to whether the input data is already log-transformed, which BPCells has no way of knowing. Taking group mean and then log-transforming gives different results from doing the log-transform first. Without matching the exact transformations Seurat is using to calculate group means, we can't expect to calculate matching log2FC values exactly.

As for @Gesmira's questions, BPCells does have a (currently-unexported) rank_transform function (documented here), which returns offset ranks to preserve sparsity, so I'm not sure if that would work properly with limma::rankSumTestWithCorrelation or not. marker_features should be able to return the equivalent p-values at any rate, and Seurat could probably continue to calculate a custom FoldChange separately from the p-value results

mihem commented 1 year ago

Hi Gesmira and Ben,

thanks for your quick response and all your great work!

Here's the updated version:

https://osmzhlab.uni-muenster.de/nfs/bpcells/bpcells_debug.html

@Gesmira I am sorry, you were right, I didn't update for a few weeks and didn't include a session Info.

I updated to the most github version, and then it works, or well as you say log2FC work (and they are the same as when calculcated with the sparse matrix!), but no p values (which are essential imo). Also it's still worlds slower than BPCells::marker_features (even calculating one cluster is several times slower than calculating all cluster with BPCells::marker_features).

@bnprks sorry, you are right, unadjusted p values are the same, and you are right I used BH correction (because that's what it said in the tutorial). Since we are essentially double dipping https://arxiv.org/abs/2207.00554, we should probably be stricter and use Bonferroni (or even better using countsplit, but that's out of scope here). The difference we can still see, is probably because in the BPcells version it's corrected for ALL markers, and in the Seurat version only for the one in cluster 13 (above the threshold). Which version would you think is correct?

Concerning logFC, there is a lot of debate whether it's correctly done in Seurat https://github.com/satijalab/seurat/issues/6654 https://github.com/satijalab/seurat/issues/6701

@bnprks so the correct way seems to be first log, then mean. Is that the way BPCells does it?

foreground_mean - background_mean)/log(2))

based on the data (lognormalized) slot?

If yes, then a Seurat user could just use BPcells function (since BPCells is a dependency anyway) .. or as you say @gesmira integrate the rank_transform function into Seuratv5 and use FoldChange (double checking it's first log then mean). But it would be great if this awesome performance of BPcells::marker_genes could be used.

Thanks!

bnprks commented 1 year ago

Answering briefly since I think our discussion might be starting to drift from discussing bugs to discussing opinions on analysis strategies

  1. Doing p-val correction on a per-cluster basis makes sense to me, and I don't have a strong opinion on BH vs Bonferroni -- honestly I mostly consider p-values in single cell marker analysis like arbitrary scores since counting each cell as an independent sample is already questionable from a statistics perspective.
  2. For log2FC, BPCells certainly doesn't have an official way which is why log2FC is not a column returned by marker_features(). If you are starting from log-transformed data doing the mean after the log seems easiest to compute -- it's a way I describe in the docs. My personal preference would be to calculate means prior to log-transformation if possible, though I'll admit I haven't thought very deeply about the issue.

I'm happy to answer any further questions about what BPCells actually calculates in marker_features(). For discussions of the best methodology to do analysis, I've mostly tried to avoid making those decisions in BPCells so downstream users can do what they like best.

mihem commented 1 year ago

Thanks.

Sorry, don't want to talk about analysis strategies.

  1. Okay
  2. Well, but if the way you describe it in the docs (and how i did it in the above example) is mean of the log (which is also the correct way it seems https://rpubs.com/hrlai/meanlog_logmean) I don't understand the difference to Seurat's logFC. But maybe that's more a question for @Gesmira than for you. But nonetheless BPCells::marker_features (sorry not marker_genes) is an enrichment for the single cell community because of it's performance.
bnprks commented 1 year ago

I'm glad the performance is handy for you mihem -- I'll also mention the presto library which predates BPCells and uses much the same strategy to do fast Wilcoxon rank-sum tests with similarly good performance.

Thanks for the references about mean-log ordering -- I'll have to think if there's a good way I can add a note to the docs to point towards helpful resources for end-users. I'll likely never be able to return a logFC column directly since BPCells doesn't have the metadata access to know if the data has already been log-normalized, but at least I can point people in the right direction for how to calculate it themselves.

mihem commented 1 year ago

Thanks.

Yes, presto has been my prefered choice for identifying markers the last few years. But presto does not work with BPCells Matrix (also show that in my reprex above), so great that you created marker_features :+1:

Sorry, I still don't really get why marker_features give different logFC results from Seurat::Find_Markers. Can you explain that?

This is now mostly again concerning SeuratTeam i guess @Gesmira I found this very handy as a end user: https://github.com/satijalab/seurat-wrappers/blob/master/R/presto.R : Seurat wrapper that used the performance of presto::wilcoxauc and produced the same output as FindMarkers. Because otherwise identifying markers in all clusters takes hours. Of course just running feature_markers followed by

....
mutate(log2fc = (foreground_mean - background_mean)/log(2)) |>
mutate(p_val_adj = p.adjust(p_val_raw, method = "bonferroni")) 

would be also a great and super easy solution if it were mathematically correct and optimally gave the same as the Seurat FindMarkers.

@samuel-marsh maybe also interested in that since scCustomize also relies on presto -> so will not work with Seuratv5 and BPCellsMatrix.

samuel-marsh commented 1 year ago

Hi,

Just chiming in here on the scCustomize portion. scCustomize doesn't depend on presto (although it is compatible with presto outputs). The only functions which interact with DE results are Extract_Top_Markers and Add_Pct_Diff. Both are customizable via specified parameters to define columns in marker data.frame to be used. So I think scCustomize should be unaffected by any issues here.

I use presto as example in the vignettes here but that is just example of non-Seurat DE data.frame. So long as you can supply a column name to rank markers by, gene_column (if present; and gene_rownames_to_column = TRUE if not) then you should be good to go.

Best, Sam

Gesmira commented 1 year ago

Hi all, Thanks for your patience. I've just pushed a fix to the Seurat5 branch that implements the BPCells marker_features function to be called when users run FindMarkers or FindAllMarkers on objects with BPCells matrices. You can restart your R session and install the new version like so:

remotes::install_github("satijalab/seurat", "seurat5")

In terms of FoldChange, our implementation of the BPCells marker_features ends up leading to the same fold change results as if you ran FindMarkers on an object with a default matrix type. Although we recognize there is some debate here, we are currently maintaining the way we do FoldChange so as to be consistent with our prior code.

However, our team is working on a new implementation of FindMarkers that should be faster for all data types ideally by the Seurat 5 release.

mihem commented 1 year ago

Great. Thanks Gesmira and Ben, I think from a user perspective this is ideal!

@Gesmira Unfortunately I cannot run my previous example because of issues with FindVariableFeatures in version Seurat_4.9.9.9065 I posted a new issue in Seurat https://github.com/satijalab/seurat/issues/7850

Gesmira commented 1 year ago

Fixed now!

mihem commented 1 year ago

Agree. And yes great speed up by implementing Ben's marker_genes and same results: image

image