bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

fix lossing dimnames of `[<-` method #69

Closed Yunuuuu closed 9 months ago

Yunuuuu commented 9 months ago

fix https://github.com/bnprks/BPCells/issues/67

Yunuuuu commented 9 months ago

Another issue related to this, [ for ColBindMatrices or RowBindMatrices will also lossing dimnames. Because rbind2 and cbind2 removed dimnames: https://github.com/bnprks/BPCells/blob/0d5652425096153e3121f656e48e05e33333a1e3/R/matrix.R#L1046 when subsetting only one matrix, this just return the matrix in the matrix_list slot (dimnames removed) without restoring back the dimnames. https://github.com/bnprks/BPCells/blob/0d5652425096153e3121f656e48e05e33333a1e3/R/matrix.R#L1111

I have tried to fix by restoring dimnames but it will cause other issues from my test, .

setMethod("[", "RowBindMatrices", function(x, i, j, ...) {
  if (missing(x)) stop("x is missing in matrix selection")
  # Handle transpose via recursive call
  if (x@transpose) {
    return(t(t(x)[rlang::maybe_missing(j), rlang::maybe_missing(i)]))
  }

  i <- split_selection_index(i, nrow(x), rownames(x))
  j <- split_selection_index(j, ncol(x), colnames(x))

  # If we're just reordering rows/cols, do a standard matrix selection
  if (rlang::is_missing(i$subset) && rlang::is_missing(j$subset)) {
    return(callNextMethod(x, unsplit_selection(i), unsplit_selection(j)))
  }

  # if the length of our row selection is 0, do a standard matrix selection
  if (!rlang::is_missing(i$subset) && length(i$subset) == 0) {
    return(callNextMethod(x, unsplit_selection(i), unsplit_selection(j)))
  }

  x <- selection_fix_dims(x, rlang::maybe_missing(i$subset), rlang::maybe_missing(j$subset))

  last_row <- 0L
  new_mats <- list()
  new_dimnames <- list()
  rownms <- rownames(x)
  colnms <- colnames(x)
  for (mat in x@matrix_list) {
    row_start <- last_row + 1L
    row_end <- last_row + nrow(mat)

    if (!rlang::is_missing(i$subset)) {
      local_i <- i$subset[i$subset >= row_start & i$subset <= row_end] - last_row
      mat <- mat[local_i,]
    }
    if (!rlang::is_missing(j$subset)) {
      # Only pass through the subset operation to a lower-level, not the shuffle
      mat <- mat[,j$subset]
    }
    if (nrow(mat) > 0) {
      # rbind2 will remove dimnames in matrix_list, 
      # we should restore it if return a single matrix
      new_dimnames <- c(new_dimnames, 
        list(list(rownms[row_start:row_end], colnms))
      )
      new_mats <- c(new_mats, mat)
    }

    last_row <- row_end
  }
  if (length(new_mats) > 1) {
    x@matrix_list <- new_mats
  } else if(length(new_mats) == 1) {
    x <- new_mats[[1]]
    if (matrix_is_transform(x) && !is(x, "RenameDims")) {
      x@dimnames<- new_dimnames[[1L]]
    }
  } else {
    stop("Subset RowBindMatrix error: got 0-length matrix_list after subsetting (please report this BPCells bug)")
  }
  if (!rlang::is_missing(i$reorder)) {
    x <- x[i$reorder,]
  }
  if (!rlang::is_missing(j$reorder)) {
    x <- x[,j$reorder]
  }
  return(x)
})
Yunuuuu commented 9 months ago

By removing lines in: https://github.com/bnprks/BPCells/blob/0d5652425096153e3121f656e48e05e33333a1e3/R/matrix.R#L1046

[ for ColBindMatrices or RowBindMatrices can work fine, any reasons to remove dimnames?

If it's okay to removing these lines, I'll make a pull request

Yunuuuu commented 9 months ago

The same for [ method of MatrixSubset object. wrapMatrix removing dimnames https://github.com/bnprks/BPCells/blob/0d5652425096153e3121f656e48e05e33333a1e3/R/matrix.R#L30

when subsetting a MatrixSubset, it reusing the matrix in the matrix slot (dimnames have been removed) https://github.com/bnprks/BPCells/blob/0d5652425096153e3121f656e48e05e33333a1e3/R/matrix.R#L763

bnprks commented 9 months ago

Hi @Yunuuuu, I think I've just changed my mind on how to solve this and will go for some changes that I'll be pushing shortly to main rather than your suggested solution. All your code worked great, but I think I'd prefer to avoid being quite so fancy with the S4 generic dispatches since I find them a bit awkward to step through when I'm debugging. I'm sorry to have pushed a somewhat messy merge into your branch here and then decided to use a different solution. Feel free to do whatever force pushes etc. you need to get it cleaned up. And thanks again for having identified and clearly documented the issue in the first place -- I'll give you a shout-out in the changelog

To answer your question about setting dimnames to null in wrapMatrix -- the reason is that I've found when saving+loading objects from RDS the copies of dimnames can take up substantial space and memory in the resulting object. This wouldn't be an issue normally since R is pretty smart about sharing storage between identical objects but that seems to be messed up by an rds round-trip. For very large matrices the dimnames can end up requiring a large amount of memory so I wanted to avoid that.

In the long run, I'm hoping to stop storing dimnames in the S4 object and instead implement a dimnames generic that can load dimnames from disk when needed. In the mean time, this does result in bugs popping up that I wouldn't otherwise have to worry about as the cost of avoiding memory duplication issues.