bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

add `apply` function for `IterableMatrix` #104

Closed Yunuuuu closed 4 months ago

Yunuuuu commented 4 months ago

This just provide the equivalent of base::apply (strictly speaking, the same with DelayedArray::apply, since we don't support multiple MARGIN). I'm not sure if we should define a S4 generic, which is the same with DelayedArray package https://github.com/Bioconductor/DelayedArray/blob/d2636c5e4a05ec4a8331aac24c791291f5949d80/R/DelayedArray-utils.R#L724.

Yunuuuu commented 4 months ago

apply_by_row and apply_by_col often do not produce the expected results, as observed by users attempting to use apply(mat, 1, FUN) and apply(mat, 2, FUN). It would be more beneficial to define the functions according to what the user expects.

bnprks commented 4 months ago

Hi @Yunuuuu, thanks for looking at this.

First, a clarification question: you mention apply_by_col not producing the expected results. Do you mean there's a bug (i.e. not behaving according to the documented behavior) or just that users might be surprised that it doesn't work identically to the base::apply function?

Second, would this functionality need to live in BPCells or would it work to live in your BPCellsArray package?

Similarly to how rank_transform works differently in BPCells compared to colRanks for efficiency reasons, apply_by_col is different from apply to take advantage of the sparsity of typical single cell matrices.

I just ran a couple performance tests on this 20k cell 10x matrix, and found that for calculating colMeans, your suggested apply was about 8x slower than apply_by_col, and apply_by_col itself was about 2x slower than the BPCells colMeans() function.

Code example ```R # The input file has gzip compression, so we get rid of that mat <- open_matrix_10x_hdf5("20k_PBMC_3p_HT_nextgem_Chromium_X_filtered_feature_bc_matrix.h5") |> write_matrix_dir(tempfile()) t1 <- system.time({r1 <- apply(mat, 1, mean)}) # user system elapsed # 44.210 3.220 47.417 t2 <- system.time({r2 <- apply(mat, 2, mean)}) # user system elapsed # 28.610 0.111 28.723 t3 <- system.time({r3 <- apply_by_col(mat, function(val, row, col) {sum(val)/nrow(mat)}) |> unlist() |> rlang::set_names(colnames(mat))}) # user system elapsed # 3.274 0.067 3.339 t4 <- system.time({r4 <- colMeans(mat)}) # user system elapsed # 1.549 0.061 1.610 all.equal(r2, r3) #TRUE all.equal(r2, r4) #TRUE ```

For most of the BPCells design, I'm taking the strategy of leaving out functionality if it can't be implemented efficiently within BPCells so that users can count on all the available functions running quickly. I think implementing apply in a compatible way might be an example where an efficient implementation is not possible, but I'd be happy if a more compatibility-focused library like BPCellsArray provided an apply generic. Would that be an option for you?

EDIT: I know CRAN at least doesn't like people implementing generics on top of other package's types. Not sure if this is the same for wherever you're planning on submitting your package, but I'd be happy to give written approval for this case if that would be helpful getting your package submission approved.

Yunuuuu commented 4 months ago

Thank you for the performance test. Regarding your first question, when I mentioned that apply_by_col does not produce the expected results, I meant that it might not behave identically to the base::apply function, rather than there being a bug in its implementation. Users might be surprised by this difference in behavior as they expect similar functionality across different packages.

As for the functionality, it can either be implemented within BPCells or in the BPCellsArray package or both. Indeed, I have implemented it in BPCellsArray. If the functionality could be made accessible to BPCells users, it would be a valuable addition. This way, users can benefit from both performance and compatibility.

This commit is solely aimed at enhancing the functionality of the BPCells package and is not related to any other matter. I have implemented a new subclass of the DelayedArray, so it won't have any negative impact on BPCellArray. However, if this does not align with the design philosophy of BPCells, where functions that cannot be efficiently implemented are omitted, we can consider closing this pull request. Thank you for providing the details.

bnprks commented 4 months ago

Hi @Yunuuuu, unfortunately I think I'll have to close this pull request and leave apply out of the core BPCells library for now. This is a situation where I think it's best for BPCells to prioritize performance over ease-of-use. One smaller change I'll try to do in the coming week will be to update the docs for apply_by_col to mention that users who want a more standard apply implementation can check out your BPCellsArray package.

Thanks for your work both on providing more ease-of-use focused functions in BPCellsArray and the many helpful issues and pull requests you've contributed to this repo. I appreciate the contributions you're providing even though this one doesn't fit the priority tradeoffs to go into BPCells.

bnprks commented 4 months ago

Hi @Yunuuuu, just wanted to mention that the apply_by_col docs have been updated to link to BPCellsArray in the "see also" section, which I hope will help users that prefer the more standard apply interface implementation you've written.

Yunuuuu commented 3 months ago

Thanks!