Open Yunuuuu opened 8 months ago
There should be a protective method to avoid the removing of matrix path
In cases like this where the user has explicitly deleted or overwritten the underlying file, I'm not sure what BPCells can easily do. BPCells will by default throw an error if you try to overwrite an existing path at least. But in a disk-backed system we're a bit relying on the users to not separately delete/overwrite the files they are using.
I'm open to ideas for how to provide more protection for users. But I haven't thought of a solution worth the trade-offs yet. Do you have something specific in mind to propose?
is it possible to lock file when BPCells open it? Something like this https://github.com/grimbough/rhdf5/blob/devel/R/h5fileLocking.R
I have tested by using filelock package, but it use a advisory Lock (people still can delete it without any message)
mock_matrix <- function(ngenes, ncells) {
cell.means <- 2^stats::runif(ngenes, 2, 10)
cell.disp <- 100 / cell.means + 0.5
cell.data <- matrix(stats::rnbinom(ngenes * ncells,
mu = cell.means,
size = 1 / cell.disp
), ncol = ncells)
rownames(cell.data) <- sprintf("Gene_%s", formatC(seq_len(ngenes),
width = 4, flag = 0
))
colnames(cell.data) <- sprintf("Cell_%s", formatC(seq_len(ncells),
width = 3, flag = 0
))
cell.data
}
mat1 <- mock_matrix(30, 20)
mat2 <- mock_matrix(30, 20)
path <- tempfile()
obj1 <- BPCells::write_matrix_dir(
mat = methods::as(mat1, "dgCMatrix"),
dir = path
)
#> Warning: Matrix compression performs poorly with non-integers.
#> • Consider calling convert_matrix_type if a compressed integer matrix is intended.
#> This message is displayed once every 8 hours.
obj2 <- BPCells::write_matrix_dir(
mat = methods::as(mat2, "dgCMatrix"),
dir = path, overwrite = TRUE
)
identical(mat1, mat2)
#> [1] FALSE
identical(obj1, obj2)
#> [1] TRUE
identical(as.matrix(obj1), as.matrix(obj2))
#> Warning: Converting to a dense matrix may use excessive memory
#> This message is displayed once every 8 hours.
#> [1] TRUE
Created on 2024-03-14 with reprex v2.0.2
Some concerns for this:
overwrite=TRUE
, we should first check the path used has not been bound with other BPCells matrix objectoverwrite=FALSE
, we should check the path or directory is empty, no files were in itStarting off, I don't think I'd consider this a bug in BPCells. Yes, it is possible to overwrite files in ways that will cause BPCells to not get the data it expects, but this is true of most software that reads files. I do think adding more safety checks is a possible area to consider, so long as they only block unsafe behavior and don't interfere with normal/intended usage.
I've described some concerns about various options below. I'm skeptical a solution that addresses these is possible, and if we can't identify one I'll probably close this issue as a feature that's not feasible for us to implement currently.
overwrite=TRUE
:Do you know of a simple way to track if a path has is in use by another BPCells object? I don't want to accidentally block people from using a path of a variable that has been deleted but not garbage collected. And I'm not sure if there's a way in R to track if a variable is still valid in any scope without having to rely on the garbage collector.
Even if we could pull this off, it wouldn't solve the issue of someone running >1 R session simultaneously on their computer.
For file locking, I'm similarly hesitant since we'd need to be extremely reliable in making sure files got unlocked to avoid making a hassle for users. (objects which have been removed but not garbage collected could easily cause issues here)
Fundamentally, I'm not sure it will be possible to save users from themselves if they intentionally overwrite their files then try to read back the old data.
overwrite=FALSE
I think this is basically the existing behavior of BPCells. If you find differences from that, happy to discuss
Created on 2024-03-13 with reprex v2.0.2