buenrostrolab / FigR

Functional Inference of Gene Regulation
https://buenrostrolab.github.io/FigR/
MIT License
34 stars 10 forks source link

runGenePeakcorr Memory Issues #44

Open micahtratt opened 3 months ago

micahtratt commented 3 months ago

Hello, I am getting the following error when running the function.

Assuming paired scATAC/scRNA-seq data ..
Matrix object input detectedCentering counts for cells sequentially in groups of size  1000  ..

Computing centered counts for cells:  1  to  1000 ..
Computing centered counts per cell using mean reads in features ..

Computing centered counts for cells:  1001  to  1885 ..
Computing centered counts per cell using mean reads in features ..

Merging results..
Done!
Genes with 0 expression across cells exist ..
Removing these genes prior to running correlations ..
Number of peaks in ATAC data: 10000 
Number of genes in RNA data: 16819 

Num genes overlapping TSS annotation and RNA matrix being considered:  16197 

Taking peak summits from peak windows ..
Finding overlapping peak-gene pairs ..
Found  15230 total gene-peak pairs for given TSS window ..
Number of peak summits that overlap any gene TSS window:  7064 
Number of gene TSS windows that overlap any peak summit:  7862 

Determining background peaks ..
Using  3  iterations ..

Computing gene-peak correlations ..
Running pairs:  1 to 5000 
Running in parallel using  2 cores ..
Computing observed correlations ..
  |===========================================================================| 100%, Elapsed 00:03
Finished!

Time Elapsed:  2.86846804618835 secs 

Computing background correlations ..
  |===========================================================================| 100%, Elapsed 00:01
  |===========================================================================| 100%, Elapsed 00:01
  |===========================================================================| 100%, Elapsed 00:01
Error in FigR::PeakGeneCor(ATAC = ATACmat, RNA = RNAmat, OV = genePeakOv[chunkStarts[i]:chunkEnds[i]],  : 
  One or more of the chunk processes failed unexpectedly (returned NULL) .. Please check to see you have enough cores/m
           emory allocated

> ATAC.se
class: RangedSummarizedExperiment 
dim: 10000 1885 
metadata(0):
assays(2): '' counts
rownames(10000): chr1-633518-634475 chr1-1115753-1116711 ... chrY-19566878-19567675 chrY-26670666-26671555
rowData names(0):
colnames(1885): AAACCAGGTGCGCAAGACAGACCT-1 AAACCGGTCCAGAACGACAGACCT-1 ... TTTGAGGGTCCGTAGC-1 TTTGGTTGTGCCCGAT-1
colData names(0):

> str(RNAmat)
 num [1:18082, 1:1885] -0.2294 -0.5374 -0.1876 -0.2191 -0.0804 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:18082] "SAMD11" "NOC2L" "KLHL17" "PLEKHN1" ...
  ..$ : chr [1:1885] "AAACCAGGTGCGCAAGACAGACCT-1" "AAACCGGTCCAGAACGACAGACCT-1" "AAACGTTCATTTCTTCACAGACCT-1" "AAAGATGCAATGGGAGACAGACCT-1" ...

Whether I run 250 or 3 background peaks, I get this error. I have tried running this in different environments with different number of features (filtering the ATAC peaks from 93,000 to 10,000 and RNA genes from 16,000 to 10,000) for the 1,800 cells. I don't believe the output should be sufficiently large to be causing this error. I have approximately 200GBs of memory and 19 cores available.

Is this an error that others are running into? Or is the output file truly larger than 200GBs?!

Thanks!

vkartha commented 3 months ago

Hi there! Thanks for your interest in using our package. A few things I noticed that I'm wondering if you could look into prior to re-testing:

  1. How were the peaks derived? (even 90k seems a little lower for genome-wide calls). Your SummarizedExperiment object has 2 assays, first doesn't have a name ('') and the second has 'counts'. Can you show a sample of each of the following?:

assay(ATAC.se)[1:5,1:5] assay(ATAC.se','counts')[1:5,1:5] granges(ATAC.se)

You want to make sure these are raw peak x cell counts, and nothing else. Can you show me the original call to the main function?

  1. Your RNA matrix has negative values, which to me seems like it's scaled somehow? We expect as input normalized expression levels, not scaled values (this won't really affect the correlation computation, but just want to make sure nothing strange is happening under the hood because of it).

  2. Which genome build are you specifying here?

This is a strange error given the data size, it shouldn't really have to do with memory (thank you for testing appropropriately via downsampling etc. - I'd say also worth testing on a single core just in case something strange is happening with the parellization that might be harder to debug here). FigR does not save any abnormally large output either at this step.

Let's see if we can debug this based on the above output?

micahtratt commented 3 months ago

Hi,

Thanks for the reply @vkartha ! I adjusted the RNA and ATAC to be raw counts (I believe they were scaled by z-score previously). I am still getting the same memory/core related error and have tried running with a single core to rule out parallelization.

Any information or suggestions you have is appreciated!

`cisCorr <- FigR::runGenePeakcorr(ATAC.se = ATAC.se,
                               RNAmat = RNAmat,
                               genome = "hg38", 
                               nCores = nCores,
                               p.cut = NULL, 
                               n_bg = 1)`
` > assay(ATAC.se)[1:5,1:5]
5 x 5 sparse Matrix of class "dgCMatrix"
                   AAACCAGGTGCGCAAGACAGACCT-1 AAACCGGTCCAGAACGACAGACCT-1 AAACGTTCATTTCTTCACAGACCT-1
chr1-180734-181683                          .                          .                          .
chr1-183935-184770                          .                          .                          .
chr1-191052-191934                          .                          .                          .
chr1-629412-630393                          .                          .                          .
chr1-631285-632180                          .                          .                          .
                   AAAGATGCAATGGGAGACAGACCT-1 AAAGCATGTAACCGCCACAGACCT-1
chr1-180734-181683                          .                          .
chr1-183935-184770                          .                          .
chr1-191052-191934                          .                          .
chr1-629412-630393                          .                          .
chr1-631285-632180                          .                          .
`assay(ATAC.se,'counts')[1:5,1:5]
                   AAACCAGGTGCGCAAGACAGACCT-1 AAACCGGTCCAGAACGACAGACCT-1 AAACGTTCATTTCTTCACAGACCT-1
chr1-180734-181683                          0                          0                          0
chr1-183935-184770                          0                          0                          0
chr1-191052-191934                          0                          0                          0
chr1-629412-630393                          0                          0                          0
chr1-631285-632180                          0                          0                          0
                   AAAGATGCAATGGGAGACAGACCT-1 AAAGCATGTAACCGCCACAGACCT-1
chr1-180734-181683                          0                          0
chr1-183935-184770                          0                          0
chr1-191052-191934                          0                          0
chr1-629412-630393                          0                          0
chr1-631285-632180                          0                          0
`> granges(ATAC.se)
GRanges object with 93434 ranges and 0 metadata columns:
                         seqnames            ranges strand
                            <Rle>         <IRanges>  <Rle>
      chr1-180734-181683     chr1     180734-181683      *
      chr1-183935-184770     chr1     183935-184770      *
      chr1-191052-191934     chr1     191052-191934      *
      chr1-629412-630393     chr1     629412-630393      *
      chr1-631285-632180     chr1     631285-632180      *
                     ...      ...               ...    ...
  chrY-56868535-56869375     chrY 56868535-56869375      *
  chrY-56869529-56870353     chrY 56869529-56870353      *
  chrY-56870707-56871623     chrY 56870707-56871623      *
  chrY-56873514-56874309     chrY 56873514-56874309      *
  chrY-56879634-56880385     chrY 56879634-56880385      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths