Closed Yunuuuu closed 4 months ago
apply_by_row
and apply_by_col
often do not produce the expected results, as observed by users attempting to use apply(mat, 1, FUN)
and apply(mat, 2, FUN)
. It would be more beneficial to define the functions according to what the user expects.
Hi @Yunuuuu, thanks for looking at this.
First, a clarification question: you mention apply_by_col
not producing the expected results. Do you mean there's a bug (i.e. not behaving according to the documented behavior) or just that users might be surprised that it doesn't work identically to the base::apply
function?
Second, would this functionality need to live in BPCells or would it work to live in your BPCellsArray package?
Similarly to how rank_transform
works differently in BPCells compared to colRanks
for efficiency reasons, apply_by_col
is different from apply
to take advantage of the sparsity of typical single cell matrices.
I just ran a couple performance tests on this 20k cell 10x matrix, and found that for calculating colMeans, your suggested apply was about 8x slower than apply_by_col
, and apply_by_col
itself was about 2x slower than the BPCells colMeans()
function.
For most of the BPCells design, I'm taking the strategy of leaving out functionality if it can't be implemented efficiently within BPCells so that users can count on all the available functions running quickly. I think implementing apply
in a compatible way might be an example where an efficient implementation is not possible, but I'd be happy if a more compatibility-focused library like BPCellsArray provided an apply
generic. Would that be an option for you?
EDIT: I know CRAN at least doesn't like people implementing generics on top of other package's types. Not sure if this is the same for wherever you're planning on submitting your package, but I'd be happy to give written approval for this case if that would be helpful getting your package submission approved.
Thank you for the performance test. Regarding your first question, when I mentioned that apply_by_col
does not produce the expected results, I meant that it might not behave identically to the base::apply
function, rather than there being a bug in its implementation. Users might be surprised by this difference in behavior as they expect similar functionality across different packages.
As for the functionality, it can either be implemented within BPCells
or in the BPCellsArray
package or both
. Indeed, I have implemented it in BPCellsArray
. If the functionality could be made accessible to BPCells
users, it would be a valuable addition. This way, users can benefit from both performance and compatibility.
This commit is solely aimed at enhancing the functionality of the BPCells
package and is not related to any other matter. I have implemented a new subclass of the DelayedArray
, so it won't have any negative impact on BPCellArray
. However, if this does not align with the design philosophy of BPCells
, where functions that cannot be efficiently implemented are omitted, we can consider closing this pull request. Thank you for providing the details.
Hi @Yunuuuu, unfortunately I think I'll have to close this pull request and leave apply
out of the core BPCells library for now. This is a situation where I think it's best for BPCells to prioritize performance over ease-of-use. One smaller change I'll try to do in the coming week will be to update the docs for apply_by_col
to mention that users who want a more standard apply
implementation can check out your BPCellsArray package.
Thanks for your work both on providing more ease-of-use focused functions in BPCellsArray and the many helpful issues and pull requests you've contributed to this repo. I appreciate the contributions you're providing even though this one doesn't fit the priority tradeoffs to go into BPCells.
Hi @Yunuuuu, just wanted to mention that the apply_by_col docs have been updated to link to BPCellsArray in the "see also" section, which I hope will help users that prefer the more standard apply
interface implementation you've written.
Thanks!
This just provide the equivalent of
base::apply
(strictly speaking, the same with DelayedArray::apply, since we don't support multipleMARGIN
). I'm not sure if we should define a S4 generic, which is the same with DelayedArray package https://github.com/Bioconductor/DelayedArray/blob/d2636c5e4a05ec4a8331aac24c791291f5949d80/R/DelayedArray-utils.R#L724.