bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
134 stars 11 forks source link

write two matrix into same path cause wrong results #80

Open Yunuuuu opened 3 months ago

Yunuuuu commented 3 months ago
library(BPCells)
library(Matrix)
mock_matrix <- function(ngenes, ncells) {
    cell.means <- 2^stats::runif(ngenes, 2, 10)
    cell.disp <- 100 / cell.means + 0.5
    cell.data <- matrix(stats::rnbinom(ngenes * ncells,
        mu = cell.means,
        size = 1 / cell.disp
    ), ncol = ncells)
    rownames(cell.data) <- sprintf("Gene_%s", formatC(seq_len(ngenes),
        width = 4, flag = 0
    ))
    colnames(cell.data) <- sprintf("Cell_%s", formatC(seq_len(ncells),
        width = 3, flag = 0
    ))
    cell.data
}
tmpdir <- tempdir()
mat <- mock_matrix(30, 20)
path <- normalizePath(tempfile(tmpdir = tmpdir), mustWork = FALSE)
iterable_dgc <- methods::as(methods::as(mat, "dgCMatrix"), "IterableMatrix")
obj1 <- BPCells::write_matrix_10x_hdf5(
    mat = BPCells::convert_matrix_type(iterable_dgc, "uint32_t"),
    path = path
)
file.remove(path)
#> [1] TRUE
obj2 <- BPCells::write_matrix_10x_hdf5(
    mat = iterable_dgc, path = path,
    type = "auto"
)
testthat::expect_equal(as.matrix(obj1 + 1), as.matrix(iterable_dgc) + 1L)
#> Warning: Converting to a dense matrix may use excessive memory
#> This message is displayed once every 8 hours.
#> Error: as.matrix(obj1 + 1) not equal to as.matrix(iterable_dgc) + 1L.
#> 529/600 mismatches (average diff: 5.32e+08)
#> [1]  1.00e+00 -    2 == -1.00e+00
#> [2]  1.07e+09 -  189 ==  1.07e+09
#> [3]  1.00e+00 - 2171 == -2.17e+03
#> [4]  1.08e+09 -   15 ==  1.08e+09
#> [5]  1.00e+00 -  116 == -1.15e+02
#> [6]  1.08e+09 -   53 ==  1.08e+09
#> [7]  1.00e+00 -  136 == -1.35e+02
#> [8]  1.08e+09 -   24 ==  1.08e+09
#> [10] 1.00e+00 -  301 == -3.00e+02
#> ...

Created on 2024-03-13 with reprex v2.0.2

Yunuuuu commented 3 months ago

There should be a protective method to avoid the removing of matrix path

bnprks commented 3 months ago

In cases like this where the user has explicitly deleted or overwritten the underlying file, I'm not sure what BPCells can easily do. BPCells will by default throw an error if you try to overwrite an existing path at least. But in a disk-backed system we're a bit relying on the users to not separately delete/overwrite the files they are using.

I'm open to ideas for how to provide more protection for users. But I haven't thought of a solution worth the trade-offs yet. Do you have something specific in mind to propose?

Yunuuuu commented 3 months ago

is it possible to lock file when BPCells open it? Something like this https://github.com/grimbough/rhdf5/blob/devel/R/h5fileLocking.R

Yunuuuu commented 3 months ago

I have tested by using filelock package, but it use a advisory Lock (people still can delete it without any message)

Yunuuuu commented 3 months ago
mock_matrix <- function(ngenes, ncells) {
    cell.means <- 2^stats::runif(ngenes, 2, 10)
    cell.disp <- 100 / cell.means + 0.5
    cell.data <- matrix(stats::rnbinom(ngenes * ncells,
        mu = cell.means,
        size = 1 / cell.disp
    ), ncol = ncells)
    rownames(cell.data) <- sprintf("Gene_%s", formatC(seq_len(ngenes),
        width = 4, flag = 0
    ))
    colnames(cell.data) <- sprintf("Cell_%s", formatC(seq_len(ncells),
        width = 3, flag = 0
    ))
    cell.data
}
mat1 <- mock_matrix(30, 20)
mat2 <- mock_matrix(30, 20)
path <- tempfile()
obj1 <- BPCells::write_matrix_dir(
    mat = methods::as(mat1, "dgCMatrix"),
    dir = path
)
#> Warning: Matrix compression performs poorly with non-integers.
#> • Consider calling convert_matrix_type if a compressed integer matrix is intended.
#> This message is displayed once every 8 hours.
obj2 <- BPCells::write_matrix_dir(
    mat = methods::as(mat2, "dgCMatrix"),
    dir = path, overwrite = TRUE
)
identical(mat1, mat2)
#> [1] FALSE
identical(obj1, obj2)
#> [1] TRUE
identical(as.matrix(obj1), as.matrix(obj2))
#> Warning: Converting to a dense matrix may use excessive memory
#> This message is displayed once every 8 hours.
#> [1] TRUE

Created on 2024-03-14 with reprex v2.0.2

Yunuuuu commented 3 months ago

Some concerns for this:

  1. if overwrite=TRUE, we should first check the path used has not been bound with other BPCells matrix object
  2. if overwrite=FALSE, we should check the path or directory is empty, no files were in it
bnprks commented 3 months ago

Starting off, I don't think I'd consider this a bug in BPCells. Yes, it is possible to overwrite files in ways that will cause BPCells to not get the data it expects, but this is true of most software that reads files. I do think adding more safety checks is a possible area to consider, so long as they only block unsafe behavior and don't interfere with normal/intended usage.

I've described some concerns about various options below. I'm skeptical a solution that addresses these is possible, and if we can't identify one I'll probably close this issue as a feature that's not feasible for us to implement currently.

Point 1 overwrite=TRUE:

Do you know of a simple way to track if a path has is in use by another BPCells object? I don't want to accidentally block people from using a path of a variable that has been deleted but not garbage collected. And I'm not sure if there's a way in R to track if a variable is still valid in any scope without having to rely on the garbage collector.

Even if we could pull this off, it wouldn't solve the issue of someone running >1 R session simultaneously on their computer.

For file locking, I'm similarly hesitant since we'd need to be extremely reliable in making sure files got unlocked to avoid making a hassle for users. (objects which have been removed but not garbage collected could easily cause issues here)

Fundamentally, I'm not sure it will be possible to save users from themselves if they intentionally overwrite their files then try to read back the old data.

Point 2 overwrite=FALSE

I think this is basically the existing behavior of BPCells. If you find differences from that, happy to discuss