GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
377 stars 133 forks source link

Error with addIterativeLSI with GeneExpressionMatrix for Multiome Data with Multiple Samples #1325

Closed jmmuncie closed 2 years ago

jmmuncie commented 2 years ago

Hi all,

I am attempting to use ArchR to analyze combined scATAC-seq and scRNA-seq generated with the 10x Multiome kit. I have 8 different samples and have successfully created Arrow files and added the GeneExpressionMatrix containing the RNAseq data. As I continue through the pipeline, I am also able to perform dimensionality reduction on the ATAC data using addIterativeLSI with useMatrix = "TileMatrix". However, when I attempt to perform dimensionality reduction on the RNA data using addIterativeLSI with useMatrix = "GeneExpressionMatrix", I get an error, as described below.

I run:

proj_Mef2c <- addIterativeLSI( ArchRProj = proj_Mef2c, clusterParams = list( resolution = 0.2, sampleCells = 10000, n.start = 10 ), saveIterations = FALSE, useMatrix = "GeneExpressionMatrix", depthCol = "Gex_nUMI", varFeatures = 2500, firstSelection = "Var", binarize = FALSE, name = "LSI_RNA", logFile = createLogFile(name = "addIterativeLSI", logDir = "~/Desktop/Mef2c_ArchR_working/Mef2c_aggr_v11_depthnorm_02142022_outs/ArchRLogs") )

And I get the following error:

Error in t(t(dfVars) (ns - 1)) + t(t(dfMeans^2) ns) : non-conformable arrays

I am unable to reproduce the error using the tutorial dataset.

There doesn't appear to be much useful info in the log, but I've attached it.

ArchR-addIterativeLSI-6a092dd5ba4d-Date-2022-03-08_Time-18-08-33.log

rcorces commented 2 years ago

Hi @jmmuncie! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
Before we help you, you must respond to the following questions unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Can you recapitulate your error using the tutorial code and dataset? If so, provide a reproducible example. 3. Did you post your log file? If not, add it now.

rcorces commented 2 years ago

@jmmuncie - Can you re-test this after creating your GeneExpressionMatrix using the updated code referenced in https://github.com/GreenleafLab/ArchR/issues/507#issuecomment-1063139673. I dont think its related but would be a nice rule out. If you get the same problem again, can you provide the output of traceback()?

jmmuncie commented 2 years ago

Hi Ryan - the updated code referenced in #507 did fix import10xFeatureMatrix with my multiple samples, but as you suspected, did not fix this issue. Here is the output of traceback():

14: is.data.frame(x)
13: rowSums(t(t(dfVars) * (ns - 1)) + t(t(dfMeans^2) * ns))
12: .combineVariances(meanx, varx, ns)
11: DataFrame(.combineVariances(meanx, varx, ns))
10: DataFrame(..., check.names = FALSE)
9: cbind(deparse.level, ...)
8: cbind(featureDF[[x]], DataFrame(.combineVariances(meanx, varx, 
       ns)))
7: FUN(X[[i]], ...)
6: lapply(...)
5: .safelapply(seq_along(featureDF), function(x) {
       o <- h5closeAll()
       seqx <- names(featureDF)[x]
       meanx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))
       varx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))
       for (y in seq_along(ArrowFiles)) {
           if (useLog2) {
               meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowMeansLog2"))
               varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowVarsLog2"))
           }
           else {
               meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowMeans"))
               varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowVars"))
           }
       }
       cbind(featureDF[[x]], DataFrame(.combineVariances(meanx, 
           varx, ns)))
   }, threads = threads)
4: Reduce("rbind", .)
3: .safelapply(seq_along(featureDF), function(x) {
       o <- h5closeAll()
       seqx <- names(featureDF)[x]
       meanx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))
       varx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))
       for (y in seq_along(ArrowFiles)) {
           if (useLog2) {
               meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowMeansLog2"))
               varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowVarsLog2"))
           }
           else {
               meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowMeans"))
               varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, 
                   "/", seqx, "/rowVars"))
           }
       }
       cbind(featureDF[[x]], DataFrame(.combineVariances(meanx, 
           varx, ns)))
   }, threads = threads) %>% Reduce("rbind", .)
2: .getRowVars(ArrowFiles = ArrowFiles, useMatrix = useMatrix, seqnames = chrToRun, 
       useLog2 = TRUE)
1: addIterativeLSI(ArchRProj = proj_Mef2c, clusterParams = list(resolution = 0.2, 
       sampleCells = 10000, n.start = 10), saveIterations = FALSE, 
       useMatrix = "GeneExpressionMatrix", depthCol = "Gex_nUMI", 
       varFeatures = 2500, firstSelection = "var", binarize = FALSE, 
       name = "LSI_RNA", excludeChr = c("chrM", "chrY", "chrJH584303_chrY_random"))
rcorces commented 2 years ago

This one is complicated. Try the following code where ArchRProj is your proj_Mef2c object. It should give you the same error as before. If it does, try to figure out where in the safelapply loop it is getting hung up (which x) and then try to figure out what is being stored in meanx , varx, and ns at that point.

ArrowFiles <- getSampleColData(ArchRProj)[,"ArrowFiles"]
useMatrix <- "GeneExpressionMatrix"
seqnames <- ArchR:::.availableSeqnames(ArrowFiles, subGroup = useMatrix)
useLog2 <- TRUE

featureDF <- ArchR:::.getFeatureDF(ArrowFiles, useMatrix)

  if(!is.null(seqnames)){
    featureDF <- featureDF[BiocGenerics::which(featureDF$seqnames %bcin% seqnames),]
  }

  rownames(featureDF) <- paste0("f", seq_len(nrow(featureDF)))
  fnames <- rownames(featureDF)

  featureDF <- S4Vectors::split(featureDF, as.character(featureDF$seqnames))

  ns <- lapply(seq_along(ArrowFiles), function(y){
    length(ArchR:::.availableCells(ArrowFiles[y], useMatrix))
  }) %>% unlist

  #Compute RowVars
  summaryDF <- ArchR:::.safelapply(seq_along(featureDF), function(x){

    o <- h5closeAll()
    seqx <- names(featureDF)[x]
    meanx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))
    varx <- matrix(NA, ncol = length(ArrowFiles), nrow = nrow(featureDF[[x]]))

    for(y in seq_along(ArrowFiles)){

      if(useLog2){
        meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, "/", seqx, "/rowMeansLog2"))
        varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, "/", seqx, "/rowVarsLog2")) 
      }else{
        meanx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, "/", seqx, "/rowMeans"))
        varx[, y] <- h5read(ArrowFiles[y], paste0(useMatrix, "/", seqx, "/rowVars"))
      }

    }

    cbind(featureDF[[x]], DataFrame(ArchR:::.combineVariances(meanx, varx, ns)))

  }, threads = threads) %>% Reduce("rbind", .)
jmmuncie commented 2 years ago

Okay so when I first tried to run this I got the following error:

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'Reduce': object 'threads' not found

So I set threads <- getArchRThreads() prior to running, and tried again. Now I'm getting the the following error:

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'Reduce':
Error Found Iteration 1 :
[1] "Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : \n object '.combineVariances' not found\n"
<simpleError in get(name, envir = asNamespace(pkg), inherits = FALSE): object '.combineVariances' not found>
Error Found Iteration 2 :
[1] "Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : \n object '.combineVariances' not found\n"
<simpleError in get(name, envir = asNamespace(pkg), inherits = FALSE): object '.combineVariances' not found>
Error Found Iteration 3 :
[1] "Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : \n object '.combineVariances' not found\n"
<simpleError in get(name, envir = asNamespace(pkg), inherits = FALSE): object '.combineVariances' not found>
Error Found Iteration 4 :
[1] "Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : \n object '.combineVariances' not found\n"
<simpleE
In addition: Warning message:
In mclapply(..., mc.cores = threads, mc.preschedule = preschedule) :
39 function calls resulted in an error

ns looks like this:

> ns
[1]  6726  9810  3955  5306 12242  8126 14678 12797

meanx and varx haven't been created. I've attached a text file with my session info.

session_info.txt

jmmuncie commented 2 years ago

Okay I think I've figured out what is going wrong. Some of my chromosomes only have a single feature associated with them. Here's an example:

featureDF[20]:

SplitDataFrameList of length 1
$`chrF6-eGFP`
DataFrame with 1 row and 6 columns
         seqnames     idx   start     end    name  strand
            <Rle> <array> <array> <array> <array> <array>
f30330 chrF6-eGFP       1       0       1 F6-eGFP       3

This causes varx to be a vector of numbers, rather than a matrix. For the example above varx:

[1] 0.004314953 0.003472101 0.002994184 0.003979531 0.003213603 0.003543938 0.004044938 0.002073981

So, for these cases summedVars <- rowSums(t(t(varx) * (ns - 1)) + t(t(meanx^2) * ns)) fails with the non-comformable arrays error.

If I ensure varx is a matrix of the proper orientation by adding varx <- t(matrix(varx)) then summedVars <- rowSums(t(t(varx) * (ns - 1)) + t(t(meanx^2) * ns)) works!

Is it possible to create an update that includes a check for varx being a matrix and a forced conversion in these cases where a chromosome only has a single feature and varx is initially created as a vector?

rcorces commented 2 years ago

thanks for digging into this. I think the correct solution would be to ensure that the dimensions of the matrix dont get dropped and converted to a vector. This commit https://github.com/GreenleafLab/ArchR/commit/9a238eed04f95ce98e4fd8c2a7af3e9dd31c4870 on the new dev_LSI-singleFeature branch should fix the problem.

Can you check by installing this branch, restarting your R session, reloading ArchR, and testing. devtools::install_github("GreenleafLab/ArchR", ref="dev_LSI-singleFeature", repos = BiocManager::repositories())

jmmuncie commented 2 years ago

Hi Ryan - thanks for sticking with me on this. The dev_LSI-singleFeature branch didn't work for me. I got the error

Error in meanx[, y, drop = FALSE] <- h5read(ArrowFiles[y], paste0(useMatrix,  : 
  incorrect number of subscripts

I think the drop = FALSE doesn't work and actually isn't needed in lines 966, 967, 969, and 970. When I step through the source code, my meanx and varx matrices are created properly at this step, even in the cases of only a single row in featureDF[].

I think the error occurs at line 925 in the .combineVariances function.

If I change this from vx <- dfVars[x, ] to vx <- dfVars[x, , drop = FALSE] then it seems to work for all of my cases.

rcorces commented 2 years ago

Ah. Got it. Thanks for the correction. I've made that change in a new branch called dev_LSI-singleFeature2 via https://github.com/GreenleafLab/ArchR/commit/d28fa1d61fd3a3444a01721785b6f32bce8cbf48

Could you test that branch and just make sure it works? I know its exactly the same code change you've already tested.

Assuming its all good, I'll merge it into release_1.0.2`

jmmuncie commented 2 years ago

Yup, that branch works! addIterativeLSI() with GeneExpressionMatrix now runs to completion for my dataset.

Thanks again!

rcorces commented 2 years ago

Merged into release_1.0.2 via https://github.com/GreenleafLab/ArchR/pull/1333 thanks @jmmuncie for figuring this out!

Aya-Balbaa-72 commented 1 month ago

Hi all, I am having an issue with addIterativeLSI , using TileMatrix , I get this error:

ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-34f754ef9709-Date-2024-08-12_Time-22-18-58.603256.log If there is an issue, please report to github with logFile! Error in .availableSeqnames(ArrowFiles, subGroup = useMatrix) : Not All Seqnames Identical!

Any suggestions please how to fix that? Thank you!

jmmuncie commented 1 month ago

It's hard to know without being able to see your data, but it sounds like the seqnames do not match between the ArrowFiles in your project. Did you create all the ArrowFiles using the same reference genome? Or perhaps in one of your ArrowFiles there were no reads mapped to a chromosome in your reference genome and that chromosome was dropped from that ArrowFile and exists in the others. Anyways, I would try to examine the seqnames in each of your ArrowFiles one at a time to find out what's causing the error.