Switching from SparseMatrix to NaMatrix

hpages commented 1 month ago

@csoneson @mbstadler Let's use this issue to discuss the migration from SparseMatrix to NaMatrix.

Some background

NaArray and NaMatrix objects are still under development in the SparseArray package. They are similar to SparseArray objects but the background value is NA instead of zero. I started to work on these objects at the beginning of August before going on vacation, had to put them on the back burner for most of August, and just resumed my work on these objects yesterday with the implementation of a bunch of statistical methods. See https://github.com/Bioconductor/SparseArray/commit/185dc371b7c2418e268d9e63a5a65e3deaadb949

Some important features are still missing e.g. there's still no row summarization methods like rowSums(), rowVars() etc... only the col*() methods for now. I'll add the row*() methods later today (this will be in SparseArray 1.5.33). The package also needs a short vignette dedicated to NaArray and NaMatrix objects.

Note that the nz*() functions (nzcount(), nzwhich(), and nzvals()) that are available for SparseArray objects are replaced with nna*() functions nnacount(), nnawhich(), and nnavals() (nnavals() not ready yet). This is because the nz prefix -- which stands for "nonzero" -- is not adequate for NaArray objects, so in the new names it's replaced with the nna prefix which stands for "non-na". I'm not a big fan of the double negation here ("non not available") so if you can think of something better please let me know:

library(SparseArray)

# Construct an empty NaMatrix object:
nam <- NaArray(dim=c(5, 12))
nam
# <5 x 12 NaMatrix> of type "logical" [nnacount=0 (0%)]:
#       [,1]  [,2]  [,3] ... [,11] [,12]
# [1,]    NA    NA    NA   .    NA    NA
# [2,]    NA    NA    NA   .    NA    NA
# [3,]    NA    NA    NA   .    NA    NA
# [4,]    NA    NA    NA   .    NA    NA
# [5,]    NA    NA    NA   .    NA    NA

nnacount(nam)
# [1] 0

nnawhich(nam)
# integer(0)

Two easy ways to construct an NaMatrix or NaArray object are:

coerce an ordinary matrix or array with as(, "NaArray")
or construct an empty object like the above and add the non-na values to it with something like:

midx <- cbind(c(1:5, 5:1), 2:11) nam[midx] <- runif(10)

nnacount(nam)

[1] 10

nnawhich(nam)

[1] 6 12 18 24 30 35 39 43 47 51

nnawhich(nam, arr.ind=TRUE)

[,1] [,2]

[1,] 1 2

[2,] 2 3

[3,] 3 4

[4,] 4 5

[5,] 5 6

[6,] 5 7

[7,] 4 8

[8,] 3 9

[9,] 2 10

[10,] 1 11

In particular, please note that coercion back and forth between dgCMatrix and NaMatrix is not supported because such coercion doesn't really make sense as it would need to replace 0's with NA's, and vice-versa, in order to preserve the sparse representation. I don't think this would be a great behavior. So to go back and forth between dgCMatrix and NaMatrix, one will need to do something like this:

# Go from dgCMatrix object 'dgcm' to NaMatrix object 'nam':
nam <- NaArray(dim=dim(dgcm))
nam[nzwhich(dgcm)] <- nzvals(dgcm)

# Go from NaMatrix object 'nam' to dgCMatrix object 'dgcm' (NOT READY YET!):
nnaidx <- nnawhich(nam, arr.ind=TRUE)
dgcm <- sparseMatrix(i=nnaidx[ , 1], j=nnaidx[ , 2], x=nam[nnaidx], dims=dim(nam))

This might go in dedicated functions.

Anyways, you should not need the intermediate dgCMatrix representation anymore in footprintR::readModkitExtract() if you replace the SparseMatrix object with an NaMatrix object.

The footprintR use case

Right now the readModkitExtract() function in footprintR constructs a SparseMatrix object where non-available values are represented with zeros, and zeros are represented with the value 0.02. This is achieved thru the combination of this line and these lines.

An NaMatrix object would be more natural.

To construct the NaMatrix, replace lines 219-225:

modmat[[nm]] <- SparseArray::SparseArray(Matrix::sparseMatrix(
    i = GenomicRanges::match(gposL[[nm]][idx], gpos),
    j = match(x$read_id[idx], readL[[nm]]),
    x = x$mod_prob[idx],
    dims = c(length(gpos), length(readL[[nm]])),
    dimnames = list(NULL, paste0(nm, "-", readL[[nm]]))
))

with the following lines:

namat <- NaArray(dim = c(length(gpos), length(readL[[nm]])), 
                 dimnames = list(NULL, paste0(nm, "-", readL[[nm]])),
                 type = "double")
i <- GenomicRanges::match(gposL[[nm]][idx], gpos)
j <- match(x$read_id[idx], readL[[nm]])
namat[cbind(i, j)] <- x$mod_prob[idx]
modmat[[nm]] <- namat

Also this line (line 166) would need to be removed:

# set zeros to small value for unmodified positions
# (dorado omits base modification probabilities less than 0.05)
tmp$mod_prob[tmp$call_code == "-" & tmp$mod_prob == 0] <- 0.02

And this line (line 218):

# only record observed values
idx <- which(x$mod_prob != 0)

would need to be replaced with:

idx <- which(x$call_code == "-" | x$mod_prob != 0)

The reason for this replacement is to preserve the original behavior which drops rows for which call_code is not "-" and mod_prob is zero.

Finally imports would need to be adjusted e.g. the following line at the top of the R/readModkitExtract.R file:

#' @importFrom SparseArray SparseArray

would need to be replaced with:

#' @importFrom SparseArray NaArray

and the #' @importFrom Matrix sparseMatrix line would need to be removed. Also the doc of the function would need to be modified to reflect these changes.

Feedback welcome. Dont hesitate to use the issue tracker in the SparseArray repo to report bugs or request features.

Let me know if you have questions.

csoneson commented 1 month ago

Thanks a lot @hpages, this is really great! We have started the transition in the use_NAmatrix branch. Many things already work nicely, including the construction of the SummarizedExperiment objects with the DataFrame of NaArrays as the assay. As you mentioned, having row operations would be great, and we're also currently getting an error when attempting to apply logical operations (e.g. here). But already very exciting! We'll keep you posted here about our progress.

hpages commented 1 month ago

rowSums(<NaArray>), <NaArray> >= <some value>, and <NaArray> != <some value> added to SparseArray >= 1.5.34

So with this latest version you should be able to do rowSums(y >= 0.5) and rowSums(y != 0) on your NaMatrix object y. Note that you probably want to use na.rm=TRUE to keep the (implicit) NAs out of the operation.

Let me know what other row*() methods you'll need, if any (I only implemented rowSums() for now). It'll help me prioritize.

csoneson commented 1 month ago

Very cool, thanks! This seems to work as expected. I think from our side, if we can make a wish for the next step 🙂, it would be nnavals (which you mention above is on the todo list) - I just ran through the unit tests and that seems to be the missing piece.

hpages commented 2 weeks ago

nnavals() implemented in SparseArray 1.5.39. Let me know if there are other things you need for footprintR.

csoneson commented 2 weeks ago

Thank you! 🤩

fmicompbio / footprintR

Switching from SparseMatrix to NaMatrix #7

Some background

[1] 10

[1] 6 12 18 24 30 35 39 43 47 51

[,1] [,2]

[1,] 1 2

[2,] 2 3

[3,] 3 4

[4,] 4 5

[5,] 5 6

[6,] 5 7

[7,] 4 8

[8,] 3 9

[9,] 2 10

[10,] 1 11

The footprintR use case