bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

How to implement rank function ties.method = "first" #91

Open Liripo opened 5 months ago

Liripo commented 5 months ago

I see BPCells:::rank_transform(mat = object, axis = axis) and that the method is average, is it possible to implement first?

example colRanks(mtx,ties.method = "first")

bnprks commented 5 months ago

Hi @Liripo, thanks for your question! Overall, yes this is something that would be possible. First, I think it might be better for me to understand what you want to do next with the matrix such that ties.method = "first" is required for you.

Currently, rank_transform() is different from colRanks in an important way: the ranks in rank_transform() are offset such that values that are 0 in the input matrix get a rank of 0. So, for example an input column of c(-5,0,0,1) would get ranks of c(-1.5, 0, 0, 1.5) instead of c(1, 2.5, 2.5, 4) as you might expect from colRanks(ties.method="average").

The reason for this offset is because most operations in BPCells perform much faster for sparse matrices (e.g. matrices with many zeros), so it is advantageous to keep zeros in the input to rank_transform() zero in the output of rank_transform(). This is possible for the method average, because all zeros get the same rank, so we can just shift each column by the appropriate amount.

However, this would no longer be possible if we used the method first for ties, since every zero entry would get a different rank. Depending on what you did next with the matrix, it's possible BPCells would still provide useful memory savings, but it would probably be much slower than the existing approach of averaging ties and applying an offset.

What did you have in mind to do with ties.method = "first"?

Liripo commented 5 months ago

The main purpose of using the ties.method = "first" method is to use the gene rank matrix to calculate the AUC value.

bnprks commented 5 months ago

Ah, if you're interested in AUC calculations I think that can be accomplished with just a slight modification of the existing wilcoxon code in marker_features(). Looking this up on wikipedia, it seems that there's a simple relationship between the test statistic in the wilcoxon test and the AUC value. I'm not sure that using ties.method = "first" is the standard way to calculate AUC here -- looking at e.g. the presto package they just use the same conversion formula as listed in that wikipedia link on top of their default mean based ranking function.

Would it work for you if I just added an option to get the AUC directly? I think that would both be simpler to implement and I think better address your specific need

Liripo commented 5 months ago

I understand that there seems to be a difference. I was hoping to get the AUC values for genes set, similar to the AUCells package.

bnprks commented 5 months ago

Hi @Liripo, sorry for the delay on this. I've just added a pair of functions that I think will help you a lot if you want to implement something like this in BPCells: apply_by_row() and apply_by_col(). These let you write an R function that will get a row (or column) of data at a time and return a summary value.

For your use case, your custom function would get all the data for an individual cell, then calculate+return the AUCell score(s) for that cell. You'd get a list of the scores with one entry per cell at the end.

I still think ties.method = "average" probably makes sense when implementing AUCell, as using average is equivalent to calculating the expected rank across many different shuffles of the data (and it's totally valid to calculate AUC off of the average method). The part that will make AUCell calculation tricky is having the cutoff for just the top 5% of genes in each cell, particularly when ties cause some entries to be ambiguously above/below the cutoff. (In those cases, I think adding the ranks with appropriate fractional weights should be fine)

At any rate, I don't think I have the bandwidth to dive in to a full AUCell implementation right now (though it looks like a quite interesting method!). I'm hopeful that apply_by_col() will give you the tools you need to calculate it the way you think is best (including doing rank with ties.method = "first" if you want). And if you get a working implementation I'd be happy to link others to your code so they can try it out, or even consider helping with a full C++ implementation modeled off your R code for faster performance.

If you have any questions about working with rank_transform() or apply_by_col() I'd be happy to answer or provide some examples. There's a couple basic usage examples in the test code here