GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
388 stars 140 forks source link

Bug in addGeneExpressionMatrix using Mouse Data #2000

Open Nahuck opened 1 year ago

Nahuck commented 1 year ago

I tried the most updated 1.0.3 and Developer version.

When using the addGeneExpressionMatrix MitoRatio and RiboRatio are not calculated when using mouse data. It seems the issue is within the function with the grep getting "^MT" "^RP", which in mouse genomes are lowercase.

When I use human data MitoRatio and RiboRatio are calculated and is in the ArchR project.

Not sure if it is within this section of the function:

Get QC Info

assay(seRNA) <- Matrix::Matrix(assay(seRNA), sparse=TRUE) nUMI <- Matrix::colSums(assay(seRNA)) mb <- assay(seRNA) mb@x[mb@x > 0] <- 1 nGenes <- Matrix::colSums(mb) rm(mb) MitoRatio <- Matrix::colSums(assay(seRNA)[grep("^MT", rownames(assay(seRNA))),]) / nUMI RiboRatio <- Matrix::colSums(assay(seRNA)[grep("^RP", rownames(assay(seRNA))),]) / nUMI qcInfo <- DataFrame(nUMI = nUMI, nGenes = nGenes, MitoRatio = MitoRatio, RiboRatio = RiboRatio) colnames(qcInfo) <- paste0("Gex_", colnames(qcInfo))

ArchR-addGeneExpressionMatrix-c17e97195af2f-Date-2023-07-28_Time-00-07-33.log

rcorces commented 1 year ago

Hi @Nahuck! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
It is worth noting that there are very few actual bugs in ArchR. If you are getting an error, it is probably something specific to your dataset, usage, or computational environment, all of which are extremely challenging to troubleshoot. As such, we require reproducible examples (preferably using the tutorial dataset) from users who want assistance. If you cannot reproduce your error, we will not be able to help. Before going through the work of making a reproducible example, search the previous Issues, Discussions, function definitions, or the ArchR manual and you will likely find the answers you are looking for. If your post does not contain a reproducible example, it is unlikely to receive a response.
In addition to a reproducible example, you must do the following things before we help you, unless your original post already contained this information: 1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved? 2. Did you post your log file? If not, add it now. 3.__ Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.

Nahuck commented 1 year ago

Hi,

I have look through all of the issues and have not seen anything related to this issue.

Nahuck commented 1 year ago

Not sure if there is a separate function, but when I updated the grep in the function it added the Mito and Ribo Ratios in the ArchR Project.

I will also note I used the developer version to add features to align mouse mitochondrial genes.

Get the ranges for all genes, including mitochondrial genes.

features <- genes(EnsDb.Mmusculus.v79::EnsDb.Mmusculus.v79)

Import scRNA data

seRNA <- import10xFeatureMatrix( input = c(paste(sample, "/filtered_feature_bc_matrix.h5", sep="")), names = c(sample), features = features )

Updated function below with pulling internal functions from ArchR:

addGeneExpressionMatrix_mouse <- function( input = NULL, seRNA = NULL, chromSizes = getChromSizes(input), excludeChr = c("chrM", "chrY"), scaleTo = 10000, verbose = TRUE, threads = getArchRThreads(), parallelParam = NULL, strictMatch = FALSE, force = TRUE, logFile = createLogFile("addGeneExpressionMatrix_mouse") ){

ArchR:::.validInput(input = input, name = "input", valid = c("ArchRProj", "character")) ArchR:::.validInput(input = seRNA, name = "seRNA", valid = c("SummarizedExperiment")) ArchR:::.validInput(input = chromSizes, name = "chromSizes", valid = c("granges")) ArchR:::.validInput(input = excludeChr, name = "excludeChr", valid = c("character", "null")) ArchR:::.validInput(input = scaleTo, name = "scaleTo", valid = c("numeric")) ArchR:::.validInput(input = verbose, name = "verbose", valid = c("boolean")) ArchR:::.validInput(input = threads, name = "threads", valid = c("integer")) ArchR:::.validInput(input = parallelParam, name = "parallelParam", valid = c("parallelparam", "null")) ArchR:::.validInput(input = strictMatch, name = "strictMatch", valid = c("boolean")) ArchR:::.validInput(input = force, name = "force", valid = c("boolean")) ArchR:::.validInput(input = logFile, name = "logFile", valid = c("character"))

if(inherits(input, "ArchRProject")){ ArrowFiles <- getArrowFiles(input) allCells <- rownames(getCellColData(input)) outDir <- getOutputDirectory(input) if(is.null(chromSizes)){ chromSizes <- getChromSizes(input) } }else if(inherits(input, "character")){ outDir <- "" ArrowFiles <- input allCells <- NULL if(is.null(chromSizes)){ chromSizes <- getChromSizes() } }else{ stop("Error Unrecognized Input!") } if(!all(file.exists(ArrowFiles))){ stop("Error Input Arrow Files do not all exist!") }

ArchR:::.startLogging(logFile = logFile) ArchR:::.logThis(mget(names(formals()),sys.frame(sys.nframe())), "addGeneExpressionMatrix Input-Parameters", logFile = logFile)

cellsInArrows <- unlist(lapply(ArrowFiles, ArchR:::.availableCells), use.names=FALSE) if(!is.null(allCells)){ cellsInArrows <- allCells }

overlap <- sum(cellsInArrows %in% colnames(seRNA)) / length(cellsInArrows) ArchR:::.logMessage("Overlap w/ scATAC = ", round(overlap,3), logFile = logFile, verbose = TRUE)

if(overlap == 0){ stop("No overlapping cell names found between ArrowFiles and seRNA object! Cell names in ArrowFiles must match colnames in seRNA!") } else if(overlap != 1) { if(strictMatch){ stop("Error! 'strictMatch = TRUE' and not all cells in input are represented in the provided gene expression seRNA. To proceed, please subset your ArchRProject using the subsetArchRProject() function to contain only cells present in seRNA or set 'strictMatch = FALSE'.") } else { ArchR:::.logMessage("Warning! Not all cells in input exist in seRNA! This may cause downstream issues with functions that require information from all cells. For example, addIterativeLSI() will not work on this GeneExpressionMatrix! To remove these mis-matched cells, subset your ArchRProject using the subsetArchRProject() function to contain only cells present in seRNA and set 'strictMatch = TRUE'", logFile = logFile, verbose = TRUE) } }

splitCells <- split(cellsInArrows, stringr::str_split(cellsInArrows, pattern = "#", simplify=TRUE)[,1]) overlapPerSample <- unlist(lapply(splitCells, function(x) sum(x %in% colnames(seRNA)))) ArchR:::.logMessage("Overlap Per Sample w/ scATAC : ", paste(paste(names(overlapPerSample), round(overlapPerSample,3), sep = "="), collapse=","), logFile = logFile, verbose = TRUE)

Get QC Info

assay(seRNA) <- Matrix::Matrix(assay(seRNA), sparse=TRUE) nUMI <- Matrix::colSums(assay(seRNA)) mb <- assay(seRNA) mb@x[mb@x > 0] <- 1 nGenes <- Matrix::colSums(mb) rm(mb) MitoRatio <- Matrix::colSums(assay(seRNA)[grep("^mt-", rownames(assay(seRNA))),]) / nUMI RiboRatio <- Matrix::colSums(assay(seRNA)[grep("^Rps|^Rpl", rownames(assay(seRNA))),]) / nUMI qcInfo <- DataFrame(nUMI = nUMI, nGenes = nGenes, MitoRatio = MitoRatio, RiboRatio = RiboRatio) colnames(qcInfo) <- paste0("Gex_", colnames(qcInfo))

Filter seRNA

seRNA <- seRNA[BiocGenerics::which(seqnames(seRNA) %bcin% seqnames(chromSizes))] seRNA <- seRNA[BiocGenerics::which(seqnames(seRNA) %bcni% excludeChr)]

Dedup

idxDup <- which(rownames(seRNA) %in% rownames(seRNA[duplicated(rownames(seRNA))])) names(idxDup) <- rownames(seRNA)[idxDup] if(length(idxDup) > 0){ dupOrder <- idxDup[order(Matrix::rowSums(assay(seRNA[idxDup])),decreasing=TRUE)] dupOrder <- dupOrder[!duplicated(names(dupOrder))] seRNA <- seRNA[-as.vector(idxDup[idxDup %ni% dupOrder])] }

Add Index To RNA Ranges

features <- rowRanges(seRNA) features <- sort(sortSeqlevels(features), ignore.strand = TRUE) features <- split(features, seqnames(features)) features <- lapply(features, function(x){ mcols(x)$idx <- seq_along(x) return(x) }) features <- Reduce("c",features) rowData(seRNA)$idx <- features[rownames(seRNA)]$idx

ArchR:::.logThis(qcInfo, "qcInfo", logFile = logFile)

Add args to list

args <- mget(names(formals()), sys.frame(sys.nframe()))#as.list(match.call()) args$ArrowFiles <- ArrowFiles args$allCells <- allCells args$X <- seq_along(ArrowFiles) args$FUN <- ArchR:::.addGeneExpressionMat args$registryDir <- file.path(outDir, "addGeneExpressionMatRegistry") args$qcInfo <- qcInfo args$seRNA <- seRNA

Remove Input from args

args$input <- NULL args$chromSizes <- NULL args$strictMatch <- NULL

Run With Parallel or lapply

outList <- ArchR:::.batchlapply(args)

ArchR:::.endLogging(logFile = logFile)

#Return Output
if(inherits(input, "ArchRProject")){

  qcInfo <- qcInfo[rownames(qcInfo) %in% input$cellNames, ]

  for(i in seq_len(ncol(qcInfo))){
input <- addCellColData(
  ArchRProj = input, 
  data = as.vector(qcInfo[,i]), 
  name = paste0(colnames(qcInfo)[i]), 
  cells = paste0(rownames(qcInfo)), 
  force = force
)
  }

  return(input)

}else{

  return(unlist(outList))

}

}