Aggregating functions and NAs

LTLA / scuttle

Clone of the Bioconductor repository for the scuttle package.

https://bioconductor.org/packages/devel/bioc/html/scuttle.html

9 stars 7 forks source link

Aggregating functions and NAs #33

Open TuomasBorman opened 2 days ago

TuomasBorman commented 2 days ago

Hi,

Functions like mean(), sum(), and others typically include an na.rm parameter, which allows you to control whether NA values should be excluded from calculations. However, it appears that aggregation functions in scuttle do not provide an option for excluding NA values. While one could convert NA values to 0, this approach is not appropriate for calculating the mean.

Could an na.rm option be added to handle this case?


library(scuttle)
sce <- mockSCE()[1:5, 1:2]
ids <- c("A", "B", "A", "C", "D")
assay(sce)[1, ] <- NA

sumCountsAcrossFeatures(assay(sce), ids)
  Cell_001 Cell_002
A       NA       NA
B      111       98
C      260      233
D       64       70

summarizeAssayByGroup(sce, c("A", "A")) |> assay()
              A
Gene_0001    NA
Gene_0002 104.5
Gene_0003   5.0
Gene_0004 246.5
Gene_0005  67.0

LTLA commented 1 day ago

Hm... it's very unusual to have NA in a single-cell count matrix. In fact, I don't think I've ever seen an NA in any assay matrix for any technology I've worked on.

I suppose it's possible to add an na.rm= option, but it's not a scenario that I've ever encountered, so I'm afraid it's not a high priority for me at the moment. If you want to give it a crack, I'm happy to consider a PR. Otherwise, you might consider some more expedient solutions:

Just filter out rows with NAs, if they are going to be all-NA rows.
Use DelayedArray::rowsum, which does have a na.rm= option.

TuomasBorman commented 1 day ago

NAs might occur, for instance, when datasets are merged. I think na.rm option would be good addition. I can create a PR, I will come back to you this week

LTLA commented 9 hours ago

From the discussions in #34, the best solution for now is probably to use rowsum():

y <- Matrix::rsparsematrix(10000, 5, density=0.1)
z <- sample(LETTERS[1:5], nrow(y), replace=TRUE)
y[1] <- NA

y2 <- DelayedArray::DelayedArray(y) # avoid copy of data on is.na(), type coercion.
is.okay <- !is.na(y2)
DelayedArray::type(is.okay) <- "integer"
DelayedArray::rowsum(y, z, na.rm=TRUE) / DelayedArray::rowsum(is.okay, z)

Make sure you use DelayedArray::rowsum instead of base::rowsum, as the latter won't be able to handle other matrix types.