drisso / SingleCellExperiment

Clone of the Bioconductor repository for the SingleCellExperiment package, see https://bioconductor.org/packages/devel/bioc/html/SingleCellExperiment.html for the official development version.
65 stars 18 forks source link

cbind() hangs if assays don't match #44

Open nathancfox opened 4 years ago

nathancfox commented 4 years ago

When I try to use SingleCellExperiment::cbind(sce_obj_1, sce_obj_2), the method hangs indefinitely if the assays don't match. In my case, I was trying to merge two sets of cells. Each object had a single assay. I made a mistake and left the counts assay in the first object and the logcounts assay in the second object. Both objects only had 1 assay, and all the metadata dataframes were compatible. The method ran for several hours before I cancelled it. When I realized my mistake, and reformatted so that both had the counts assay, then it ran in seconds. I also tried just renaming the incorrect assay, and that also fixed the problem. It would be nice if the SingleCellExperiment::cbind method caught this error and quit with a warning, instead of hanging indefinitely. When you're working with large objects, sometimes it is believable that it might take a few minutes to merge, and it was unclear whether or not the method was hung. Thank you!

Here is my sessionInfo(). I'm running on the IRKernel in Jupyter Lab,

image

LTLA commented 4 years ago

Sounds like a SummarizedExperiment problem rather than a SCE problem, we don't do any special handling of the assays other than to call down to the SE method for combining things.

se_obj_1 <- as(sce_obj_1, "SummarizedExperiment")
se_obj_2 <- as(sce_obj_2, "SummarizedExperiment")

# Temporary hack for Bioconductor/SummarizedExperiment#29
rownames(se_obj_1) <- rownames(sce_obj_1)
rownames(se_obj_2) <- rownames(sce_obj_2)

# Does this still hang?
cbind(se_obj_1, se_obj_2)

I don't have any familiarity with whatever the IRKernel is, so it would nice to see what happens on a standard R installation. Some information of the dimensions of your two objects would also be useful.

drisso commented 4 years ago

I've tried on a vanilla R session the following and it throws an error as expected.

library(SingleCellExperiment)
example(SingleCellExperiment)
sce1 = sce
assays(sce1) = assays(sce1)[1]
sce2 = sce
assays(sce2) = assays(sce2)[2]
cbind(sce1, sce2)

Error in .bind_Assays_objects(objects, along.cols = TRUE) : 
  assays must have the same names()

So, I agree with @LTLA that it would be nice to see if you get the same behavior on a standard R installation and/or if my example above hangs on your R.

nathancfox commented 4 years ago

Thank you both so much for your help!

@LTLA When I try your solution in my working environment, it does not change anything. It still hangs.

@drisso When I try your code in my working environment, it throws an error as expected. Same if I try your code in a standard R installation.

When I try my code in a standard R installation, it throws the correct error, and Then hangs until I pass a CTRL-C kill signal. @LTLA 's fix does not change this on my objects in a standard R install.

I'm just noticing that the rownames don't line up between my two objects. However, I tried fixing this and it doesn't fix my hung process problem, even with @LTLA 's test. Another wrinkle is that I don't attach any packages, I only load them into namespaces because I don't like not knowing which method comes from which package or which methods have been overloaded. However, I tried this fresh in a new environment where I loaded packages with library() like normal, and it didn't help.

Here is the print output of my two objects:

> sce_obj_1
class: SingleCellExperiment
dim: 39483 3477
metadata(3): Samples Samples Samples
assays(1): counts
rownames(39483): GRMZM2G059865 GRMZM5G888250 ... GRMZM2G372364
  GRMZM2G525788
rowData names(2): ID Symbol
colnames(3477): AAACCCAAGCTTTGTG_4 AAACCCAAGGGAGGCA_4 ...
  TTTGATCGTTACCCTC_6 TTTGGTTCAGCTCTGG_6
colData names(7): Sample Barcode ... batchclust metaclust
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

> sce_obj_2
class: SingleCellExperiment
dim: 39483 9271
metadata(7): log.exprs.offset log.exprs.offset ... log.exprs.offset
  log.exprs.offset
assays(1): logcounts
rownames(39483): GRMZM2G059865 GRMZM5G888250 ... Zmcle7 Zmcle25
rowData names(2): ID Symbol
colnames(9271): AAACCTGCAAACTGCT_1 AAACCTGGTTACGTCA_1 ...
  TTTGTCACAGGGAGAG_3 TTTGTCACAGTATCTG_3
colData names(7): Sample Barcode ... batchclust metaclust
reducedDimNames(0):
spikeNames(0):
altExpNames(0):

This makes me think that it's a bug being caused by a combination of

  1. The specific characteristics of my objects
  2. The fact that I'm working in Jupyter Lab with the IRkernel

because the example(SingleCellExperiment) object throws the error and stops correctly in both my environment and in a standard environment, but my objects throw two different forms of the error in my environment and a standard environment. And that kind of bug is definitely not your problem to solve, so please don't feel obligated to solve this! I know debugging by suggestion in someone else's environment is hell, so I don't mind at all if it's a quirk of my particular situation that I have to be aware of! Especially because it's not stopping me from using your package.

LTLA commented 4 years ago

That's barely even 10000 cells, which is certainly not what I'd consider to be "large".

I will note, though, that historically the traceback for the error has been generated by deparsing these large objects. As in, if you printed the traceback, you'd get a faceful of:

new("SingleCellExperiment", int_elementMetadata = new("DFrame",
    rownames = NULL, nrows = 200L, listData = structure(list(), .Names = character(0)),
    elementType = "ANY", elementMetadata = NULL, metadata = list()),
    int_colData = new("DFrame", rownames = NULL, nrows = 100L,
        listData = list(reducedDims = new("DFrame", rownames = NULL,
            nrows = 100L, listData = list(PCA = structure(c(0.925475550582632,
            0.449077717959881, 0.855550958309323, 0.413164169760421,
            0.201481711119413, 0.553348598768935, 0.157802519388497,
            0.929640536662191, 0.206698389025405, 0.499209743458778,
            0.238557304022834, 0.0674089007079601, 0.927704845322296,
            0.343449579318985, 0.904032409889624, 0.5190003963653,
            0.822366764070466, 0.0936046475544572, 0.196957593550906,
            0.238983205752447, 0.0639121262356639, 0.914775134529918,
            0.315773038426414, 0.942449612077326, 0.369616641430184,
            0.66291938116774, 0.788995310198516, 0.0332865277305245,
            0.973804891575128, 0.36193488468416, 0.313562945229933,
            0.727034271229059, 0.13324950938113, 0.914922809693962,
            0.405528278788552, 0.968635136727244, 0.356382932746783,
            0.677552652079612, 0.332781387493014, 0.917663701577112,
            0.150005814153701, 0.392979303607717, 0.51302184863016,
# etc.

For large objects, this takes a long time, simply because you need to pack the entire object into a string and print it to the console. As of R 4.0.0, this should no longer be the case:

As of R 4.0.0 .Traceback contains the calls as language objects, whereas previously these calls were deparsed.

nathancfox commented 4 years ago

Oh that's an interesting idea. When it gets released, I'll try replicating this issue and see what happens. Thanks very much for the help!