ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
93 stars 26 forks source link

mosaic::sample can't handle dgCMatrix ? #793

Closed akhst7 closed 2 years ago

akhst7 commented 2 years ago

Hi,

I want to sample this dgCMatrix obj;

str(g.adj)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:10174164] 851 1247 1353 2898 3266 4100 5772 8483 9148 11190 ...
  ..@ p       : int [1:99193] 0 77 184 232 248 327 395 486 567 654 ...
  ..@ Dim     : int [1:2] 99192 99192
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : NULL
  ..@ x       : num [1:10174164] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ factors : list()

I run the mosaic sample as follows;

mini.adj<-mosaic::sample(g.adj, size = 100, replace = F, orig.ids = F)

str(mini.adj)
 num [1:100] 0 0 0 0 0 0 0 0 0 0 ...

Sample() is not random sampling the matrix. Am I missing something here ?

nicholasjhorton commented 2 years ago

You are correct that the current behavior isn't appropriate (it should throw an error if given a dgCMatrix object).

If it were to sample would it be from the i's, the p's, or the Dim's? Any thoughts or pointers would be welcomed.

akhst7 commented 2 years ago

@nicholasjhorton, actually it is more complicated than I initially thought. I attempted to convert dgCMatrix to the matrix and run mosaic's sample but this did not work when a dgCMatrix obj is extremely large.

As I understand correctly. Iand p are just row and column index respectively and Dim is just a dimension of dgCMatrix. x represents a non-zero count but does not have a positional info. I have to think about this a bit.

Thanks.

rpruim commented 2 years ago

@akhst7 : mosaic::sample() is generic and does have a method for objects of class matrix (it samples rows of the matrix), but there is no method for an object of class dgcMatrix, so at best you should get whatever base::sample() does in that situation. Based on minimal testing, that seems to be what is happening. Since base::sample() doesn't generate an error here, we don't want to either.

I'd suggest handling your situation by manually sampling the row indices (if that's what how you mean to be sampling from your matrix). That's basically all mosaic::sample() does for a matrix. Here's the core of the function:

    n <- nrow(x)
        ids <- base::sample(n, size, replace=replace, prob=prob)
    data <-  x [ ids, , drop=FALSE] 
    names(data) <- names(x)

If data <- x [ ids, , drop=FALSE] does what you want with a dgCMatrix, then you are all set.

If that's all it takes, we could fold this into mosaic. But it isn't really the sweet spot for the mosaic package. We only have a dependency of the Matrix package to avoid awkward coexistence problems.