Handling current_block_id and effective_grid for tooling on top of blockApply

LTLA commented 3 years ago

Yes, I know I really shouldn't be tangling with either of these two variables, but bear with me here...

beachmat implements the colBlockApply() function, which was originally a wrapper around blockApply with grid= just set to colAutoGrid(x) so that I didn't have to keep on typing it all out. (Same for rowBlockApply.) Over time, though, it evolved to add some additional features that were useful to me:

Avoid the overhead of a *gCMatrix to SparseArraySeed back to *gCMatrix conversion when dealing with functions that were capable of taking *gCMatrix inputs.
Split up in-memory matrices (specifically, gCMatrix and ordinary matrices) prior to calling BiocParallel functions, to avoid the cost of serializing the entire matrix when only a fragment is used in each worker.
Avoid the overhead of block processing altogether when the matrix is in memory and no parallelization is requested, in which case we can just apply FUN on the full matrix directly.

To do this, sometimes I would pass FUN to blockApply(), and other times I would apply FUN directly to the matrix or its split-up fragments. This worked pretty well, provided I added the grid ID attributes to the matrix prior to calling FUN manually. However, this is no longer the case with the changes I requested from #69. Oops.

I can mimic the creation of current_block_id and effective_grid so that it gets found by effectiveGrid() and friends, but this makes my code pretty fragile to any changes you make in the effectiveGrid() discovery mechanism. So I wonder whether it would be possible to expose a setter mechanism for my use case.

Of course, I'd be happy to push some of my changes to blockApply() itself, and then colBlockApply() could revert to being the wrapper that it used to be. However, some of those changes are a bit opinionated, e.g., it assumes that FUN is capable of taking *gCMatrix inputs because that's also what beachmat v3 supports.

hpages commented 3 years ago

So what are you asking exactly? What setter mechanism?

LTLA commented 3 years ago

Hypothetically, it would be something like this:

setGridParams <- function(grid, bid, envir=parent.frame()) {
    assign("effective_grid", value=grid, envir=envir)
    assign("current_block_id", value=bid, envir=envir)
}

Which would then be used inside blockApply like:

    setGridParams(grid, bid)
    FUN(block, ...)

which would enable a currentViewport() call in FUN to pick up those same parameters. Then I could use the same mechanism in colBlockApply() to mimic a FUN call inside blockApply().

There is also a mild bonus of making it easier for developers to debug things occurring at a specific viewport, because they can just call setGridParams() followed by FUN in their debugging session, rather than having to run through all previous blocks with blockApply().

hpages commented 3 years ago

Done (makes my own code cleaner anyway).

It's called set_grid_context().

they can just call setGridParams() followed by FUN in their debugging session, rather than having to run through all previous blocks with blockApply()

mmh.. only if they know what n to use in parent.frame(n). But even if they manage to do that, this won't change where the block they're currently seeing is coming from, or where the next block they'll see will be coming from, because this won't alter the "real" bid.

LTLA commented 3 years ago

Excellent, thanks. Yes, I figured that a hypothetical debugging process would look like:

grid <- defaultAutoGrid(x)
bid <- 50L # or whatever is the problematic chunk.
block <- read_block(x, grid[[bid]])
set_grid_context(grid, bid)
FUN(block)

... which would be an improvement over having to actually process the previous 49 blocks.

hpages commented 3 years ago

Before we had set_grid_context(), you would have set .current_block_id manually for the same result. There was never the need to actually process the previous 49 blocks. I agree that having set_grid_context() makes this easier/cleaner though... but only if you know about it.

mmh... sounds like I should probably document set_grid_context().

LTLA commented 3 years ago

Even from this discussion of a hypothetical example, I think we can see some value in the set_grid_context() function already - I had thought that the variable name was current_block_id (without the dot).

hpages commented 3 years ago

Documented.

Extra bonus:

> currentBlockId()
Error in currentBlockId() : 
  Grid context not found for the current block. Are we in a blockApply(),
  viewportApply(), or blockReduce() loop?

  Note that currentBlockId() can only be called from **within** the
  callback function of a blockApply() or viewportApply() loop ('FUN'
  argument), or from the callback functions of a blockReduce() loop
  ('FUN' and 'BREAKIF' arguments).

  If you need to be able to test/debug your callback function 'FUN' (or
  'BREAKIF') as a standalone function, call set_grid_context() to set an
  arbitrary context **right before** calling the callback function.

I know I know, nobody reads these things...

Bioconductor / DelayedArray

Handling current_block_id and effective_grid for tooling on top of blockApply #71