GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
375 stars 133 forks source link

.testMarkerSC() failing when mean1 and/or mean2 is NA during call to getMarkerFeatures() #1321

Open rcorces opened 2 years ago

rcorces commented 2 years ago

Discussed in https://github.com/GreenleafLab/ArchR/discussions/1319

Originally posted by **wuv21** March 3, 2022 Hi! Been running into an issue similar to #562 and #483 with the same error output, but I have been running ArchR v1.0.2 (commit [48dccf](https://github.com/GreenleafLab/ArchR/commit/48dccf361bab6d9191a4d9b61e1317acfa71167c)) which was the solution to both those questions. As such, I believed that the error was related to something else. By debugging through the `getMarkerFeatures()` call (see below for my function call): ```r tmp <- getMarkerFeatures( ArchRProj = proj, useMatrix = "MotifMatrix", groupBy = "condensedCluster", testMethod = "wilcoxon", bias = c("TSSEnrichment", "log10(nFrags)"), useGroups = "CD4_TRUE", bgdGroups = "CD4_FALSE", maxCells = 1000, verbose = TRUE, useSeqnames = "z" ) ``` I noticed that one of the rows in `pairwiseDF` (see source below) contained `NA` values for both mean1 and mean2 (small section of data frame shown below). The row "f156" is the only row with this issue. All other rows were ok in the rest of `pairwiseDF`. https://github.com/GreenleafLab/ArchR/blob/48dccf361bab6d9191a4d9b61e1317acfa71167c/R/MarkerFeatures.R#L422-L426 ``` log2Mean log2FC fdr pval mean1 mean2 n auc f155 3.089913 NaN 7.740502e-02 5.694163e-03 0.6838235 -0.2663543 36 0.6898148 f156 0.000000 0.00000000 2.650317e-10 3.046341e-13 NA NA 36 0.0000000 f157 3.705964 0.08330494 9.909369e-01 9.147955e-01 0.3451817 0.3242560 36 0.4922840 ``` Since `mean1` and `mean2` were NA, `idxfilter` contained a vector of both boolean values and a NA that was associated with row "f156". I think the fix below would solve this issue: ```r idxFilter <- rowSums(pairwiseDF[, c("mean1", "mean2")]) != 0 & rowSums(is.na(pairwiseDF[, c("mean1", "mean2")])) == 0 ``` Thanks! Edit: updated my fix; didn't realize Github changed a piece of code into a link. Edit2: updated fix again.
rcorces commented 2 years ago

@wuv21 - I converted this to an Issue post.

I think the most important question is why do mean1 and mean2 get set to NA in the first place? Any chance you can parse that out? I'd rather try to figure out how to prevent the NAs from getting generated than work around their existence.

wuv21 commented 2 years ago

Thanks @rcorces for moving it into issues and for your reply. I dug deeper and found that this particular motif had no matches in my dataset. I did the following:

# load in the matches rds from getPeakAnnotation()
tmp <- readRDS(getPeakAnnotation(proj, "Motif")$Matches)

# below line returns FALSE; i tried other columns corresponding to other motifs and saw that there were TRUE values
any(tmp@assays@data@listData[["matches"]][, 156])

# double checked to make sure that there were no positions either in motif 156
# other motifs had elements in the GRanges objects
tmp <- readRDS(getPeakAnnotation(proj, "Motif")$Positions)

tmp[[156]]
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     score
      <Rle> <IRanges>  <Rle> | <numeric>
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths

As such, .getPartialMatrix() returns back a matrix that has NA values associated with that particular motif when it is called in .testMarkersSC().

https://github.com/GreenleafLab/ArchR/blob/968e4421ce7187a8ac7ea1cf6077412126876d5f/R/MarkerFeatures.R#L346-L353

I think this might just be a rare case of this motif not being visible in this current project/biological context (possibly due to the lower cell counts for this current analysis). This error doesn't occur for me with other projects but I do notice that this motif has the smallest number of peaks associated with it compared to all the other motifs.

rcorces commented 2 years ago

I have temporarily patched this on a new branch (dev_idxFilter) via https://github.com/GreenleafLab/ArchR/commit/f091e426bbe6f78b149ded4f9b49e86ad4f5640f but I think the better fix is to remove these offending rows from the motif matrix entirely. Will try to address this more thoroughly soon.