GreenleafLab / ArchR

ArchR : Analysis of Regulatory Chromatin in R (www.ArchRProject.com)
MIT License
384 stars 137 forks source link

Errors in subsetArchRProject corrupts matrices #592

Closed markphillippebworth closed 3 years ago

markphillippebworth commented 3 years ago

Describe the bug When subsetting, copying errors seem to occur.

To Reproduce Using a large dataset, run through to addDeviationsMatrix. Then try to subset. In this case, the larger dataset took 2-3 days to complete addDeviationsMatrix (10 cores). I can confirm that it completed successfully, and that the matrix appears in the larger dataset when you run getAvailableMatrices(). I didn't try to access it before subsetting, and at this point, it hits an error when I try to access it in the larger one. And for the subsetted project, it doesn't appear at all.

Expected behavior When I need to subset an ArchRProject, I've encountered errors with some of the matrices not being full copied over, and then portions of (or the entire) project are corrupted and inaccessible. In this case, I can clearly relate the error to the subsetting.

I was subsetting an ArchRProject that had a motif matrix saved. It looked like everything subsetting well, but the motifmatrix was inaccessible - getAvailableMatrices didn't show it. But the original larger project did have it available. When I went to re-run addDeviationsMatrix on the subsetted project, I hit this error, which showed that some arrow files did have deviations already calculated, but for some reason, other's didn't.

2021-03-05 05:35:54 : B038_FSQCAZ0BFDX-03 (8 of 91) : Deviations for Annotation 1854 of 2065, 378.552 mins elapsed. 2021-03-05 05:50:08 : B038_FSQCAZ0BFDX-03 (8 of 91) : Deviations for Annotation 1957 of 2065, 392.772 mins elapsed. 2021-03-05 06:04:36 : B038_FSQCAZ0BFDX-03 (8 of 91) : Deviations for Annotation 2060 of 2065, 407.24 mins elapsed. 2021-03-05 06:05:58 : Finished Computing Deviations!, 409.295 mins elapsed. Error in (function (..., threads = 1, preschedule = FALSE) : Error Found Iteration 1 : [1] "Error in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, : \n Arrow Group already exists! Set force = TRUE to co ntinue!\n" <simpleError in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, force = force, logFile = logFile): Arrow Group alre ady exists! Set force = TRUE to continue!> Error Found Iteration 2 : [1] "Error in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, : \n Arrow Group already exists! Set force = TRUE to co ntinue!\n" <simpleError in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, force = force, logFile = logFile): Arrow Group alre ady exists! Set force = TRUE to continue!> Error Found Iteration 3 : [1] "Error in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, : \n Arrow Group already exists! Set force = TRUE to co ntinue!\n" <simpleError in .createArrowGroup(ArrowFile = ArrowFile, group = matrixName, force = force, logFile = logFile): Arrow G In addition: Warning message: In mclapply(..., mc.cores = threads, mc.preschedule = preschedule) : 90 function calls resulted in an error

ArchR-addDeviationsMatrix-16cf1d37e351-Date-2021-03-04_Time-23-14-28.log

rcorces commented 3 years ago

Sorry that this hasnt been addressed yet. There seem to be lots of sporadic issues surrounding project subsetting, probably related to finicky hdf5 things, and this is a priority for us to address.

Unrelated but out of curiosity, when you say a "large" dataset, how many cells? I'm surprised to hear that something is taking 3 days to run in ArchR.

markphillippebworth commented 3 years ago

In this case, I was running addDeviationsMatrix on around 750K cells, around over 120 arrow files. Is that still what you'd expect? The other steps were comparatively much much faster.

My starting dataset is about 1.9M cells.

I very much appreciate that it's a priority for you guys. There's a great deal I'd like to do with subsetCells, but it seems to consistently corrupt my ArchR projects in ways I can't document well, and subsetProject is hit or miss.

rcorces commented 3 years ago

Wow that is a lot of cells. I've never handled that many and I think the only time Jeff has done that is with a synthesized data set for benchmarking and I'm not sure if he ran addDeviationsMatrix() on that dataset.

Yes - the problems with subsetting are sporadic and difficult for us to nail down which is why this has been so problematic.

This one is above my pay grade so you'll have to wait for @jgranja24 to respond to see if he has any input.

jgranja24 commented 3 years ago

Hi @markphillippebworth, sorry for the issues. I haven't tried subsetting a larger project like that, but can you confirm that it works for the test project? There was an issue related to this which I solved in https://github.com/GreenleafLab/ArchR/issues/212. Can you try running this sample code on a small example below to see if you re-create the bug? --


#Latest Release If Needed
#devtools::install_github("GreenleafLab/ArchR", ref="release_1.0.2", repos = BiocManager::repositories())

library(ArchR)

proj <- getTestProject()
# > proj
#            ___      .______        ______  __    __  .______      
#           /   \     |   _  \      /      ||  |  |  | |   _  \     
#          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
#         /  /_\  \   |      /     |  |     |   __   | |      /     
#        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
#       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

# class: ArchRProject 
# outputDirectory: /Users/jeffreygranja/Desktop/PBMCSmall 
# samples(1): PBMCSmall
# sampleColData names(1): ArrowFiles
# cellColData names(18): Sample TSSEnrichment ... ReadsInPeaks FRIP
# numberOfCells(1): 2217
# medianTSS(1): 12.277
# medianFrags(1): 593

getAvailableMatrices(proj)
# > getAvailableMatrices(proj)
# [1] "GeneIntegrationMatrix" "GeneScoreMatrix"       "MotifMatrix"          
# [4] "PeakMatrix"            "TileMatrix"   

#Sample 100 Cells
sampleCells <- getCellNames(proj)[sort(sample(1:nCells(proj), 100))]

#Subset Project
subProj <- subsetArchRProject(proj, cells = sampleCells, outputDirectory = "Random100")
# Copying ArchRProject to new outputDirectory : /Users/jeffreygranja/Desktop/Random100
# Copying Arrow Files...

# Getting ImputeWeights
# No imputeWeights found, returning NULL
# Copying Other Files...
# Copying Other Files (1 of 9): Annotations
# Copying Other Files (2 of 9): Embeddings
# Copying Other Files (3 of 9): GroupBigWigs
# Copying Other Files (4 of 9): GroupCoverages
# Copying Other Files (5 of 9): PBMCSmall
# Copying Other Files (6 of 9): Peak2GeneLinks
# Copying Other Files (7 of 9): PeakCalls
# Copying Other Files (8 of 9): Plots
# Copying Other Files (9 of 9): RNAIntegration
# Saving ArchRProject...
# Loading ArchRProject...
# Successfully loaded ArchRProject!

#                                                    / |
#                                                  /    \
#             .                                  /      |.
#             \\\                              /        |.
#               \\\                          /           `|.
#                 \\\                      /              |.
#                   \                    /                |\
#                   \\#####\           /                  ||
#                 ==###########>      /                   ||
#                  \\##==......\    /                     ||
#             ______ =       =|__ /__                     ||      \\\
#         ,--' ,----`-,__ ___/'  --,-`-===================##========>
#        \               '        ##_______ _____ ,--,__,=##,__   ///
#         ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
#         -,____,---'       \\####\\________________,--\\_##,/
#            ___      .______        ______  __    __  .______      
#           /   \     |   _  \      /      ||  |  |  | |   _  \     
#          /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |    
#         /  /_\  \   |      /     |  |     |   __   | |      /     
#        /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
#       /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|

#Full Project
getAvailableMatrices(proj)
# > getAvailableMatrices(proj)
# [1] "GeneIntegrationMatrix" "GeneScoreMatrix"       "MotifMatrix"          
# [4] "PeakMatrix"            "TileMatrix"

#Sub Project
getAvailableMatrices(subProj)
# > getAvailableMatrices(subProj)
# [1] "GeneIntegrationMatrix" "GeneScoreMatrix"       "MotifMatrix"          
# [4] "PeakMatrix"            "TileMatrix"    

#Full Project All Cells
getMatrixFromProject(proj, "MotifMatrix")
# ArchR logging to : ArchRLogs/ArchR-getMatrixFromProject-9b855a304d5f-Date-2021-03-07_Time-15-56-40.log
# If there is an issue, please report to github with logFile!
# 2021-03-07 15:56:42 : Organizing colData, 0.034 mins elapsed.
# 2021-03-07 15:56:42 : Organizing rowData, 0.034 mins elapsed.
# 2021-03-07 15:56:42 : Organizing rowRanges, 0.034 mins elapsed.
# 2021-03-07 15:56:42 : Organizing Assays (1 of 2), 0.034 mins elapsed.
# 2021-03-07 15:56:42 : Organizing Assays (2 of 2), 0.034 mins elapsed.
# 2021-03-07 15:56:42 : Constructing SummarizedExperiment, 0.034 mins elapsed.
# 2021-03-07 15:56:43 : Finished Matrix Creation, 0.042 mins elapsed.
# class: SummarizedExperiment 
# dim: 870 2217 
# metadata(0):
# assays(2): deviations z
# rownames(870): TFAP2B_1 TFAP2D_2 ... TBX18_869 TBX22_870
# rowData names(2): idx name
# colnames(2217): PBMCSmall#TATCTGTAGACAGCTG-1
#   PBMCSmall#ATGGATCCAGGCAAGT-1 ... PBMCSmall#ACTATTCGTTACGAAA-1
#   PBMCSmall#GTTGGTACACATCATG-1
# colData names(18): BlacklistRatio DoubletEnrichment ... ReadsInPeaks
#   FRIP

#Sub Project 100 Cells
getMatrixFromProject(subProj, "MotifMatrix")
# ArchR logging to : ArchRLogs/ArchR-getMatrixFromProject-9b855af3c079-Date-2021-03-07_Time-15-56-24.log
# If there is an issue, please report to github with logFile!
# 2021-03-07 15:56:25 : Organizing colData, 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Organizing rowData, 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Organizing rowRanges, 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Organizing Assays (1 of 2), 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Organizing Assays (2 of 2), 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Constructing SummarizedExperiment, 0.008 mins elapsed.
# 2021-03-07 15:56:25 : Finished Matrix Creation, 0.015 mins elapsed.
# class: SummarizedExperiment 
# dim: 870 100 
# metadata(0):
# assays(2): deviations z
# rownames(870): TFAP2B_1 TFAP2D_2 ... TBX18_869 TBX22_870
# rowData names(2): idx name
# colnames(100): PBMCSmall#CACATGAAGGCCTAAG-1
#   PBMCSmall#ACATGGTGTAGACGCA-1 ... PBMCSmall#AAGGTTCCAACGAGGT-1
#   PBMCSmall#GAATCTGTCATAGGTC-1
# colData names(18): BlacklistRatio DoubletEnrichment ... ReadsInPeaks
#   FRIP

If you bug out running this code can you update using the devtools command and retrying? Thanks!

Jeff