ZW-xjtlu / exomePeak2

Peak calling and differential methylation for MeRIP-Seq
25 stars 5 forks source link

failing due to memory/"caught segment address (nil): cause 'unknown'" #40

Closed acurry-hyde closed 1 year ago

acurry-hyde commented 1 year ago

Hello - I've been attempting to run exomepeak2 on hg38 full transcript for chr10 only, single samples provided for ip and input control vs treated however it's been failing consistently due to 1) memory limit issues and 2) now with increased memory it's failing with the error: " caught segfault address (nil), cause 'unknown'"

I'm using an HPC linux OS, requesting a single node, 20 threads and 128gb memory. I'm using a conda compiled R4.3 environment. Within R I'm increasing the memory limit to 100GB using the 'ulimit' package.

1) memory limit issue:

> library(exomePeak2)
> library(DESeq2)
> library(GenomicFeatures)
> library(BSgenome.Hsapiens.UCSC.hg38)
> gc()
           used  (Mb)  gc trigger     (Mb) max used  (Mb)
Ncells  9343582 499.1    19313185   1031.5 11348671 606.1
Vcells 16300605 124.4 13107200000 100000.0 28107829 214.5
> txdb <- makeTxDbFromGFF(
+     file = "/ifs/data/sz3145_gp/genomes/Homo_sapiens/UCSC/hg38/Annotation/gencode.v43.annotation.gtf",
+     format = "gtf",
+     dataSource = "GENCODE GTF",
+     organism = "Homo sapiens")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> exomePeak2(
+     bam_ip = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"
+         ),
+     bam_input = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"
+         ),
+     bam_input_treated = c(   
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"
+         ),
+     bam_ip_treated = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"
+         ),
+     txdb = txdb,
+     genome = "BSgenome.Hsapiens.UCSC.hg38",
+     mode = "full_transcript" 
+     )
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... Killed

2) memory increased, caught segment error

> devtools::install_github("krlmlr/ulimit")
> library(ulimit)
> ulimit::memory_limit(100000) 
 soft  hard
1e+05   Inf
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  924928 49.4    1812270 96.8  1328472 71.0
Vcells 2404583 18.4    8388608 64.0  5079774 38.8
> library(exomePeak2)
> library(DESeq2)
> library(GenomicFeatures)
> library(BSgenome.Hsapiens.UCSC.hg38)
> txdb <- makeTxDbFromGFF(
+     file = "/ifs/data/sz3145_gp/genomes/Homo_sapiens/UCSC/hg38/Annotation/gencode.v43.annotation.gtf",
+     format = "gtf",
+     dataSource = "GENCODE GTF",
+     organism = "Homo sapiens")
> exomePeak2(
+     bam_ip = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"
+         ),
+     bam_input = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"
+         ),
+     bam_input_treated = c(   
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"
+         ),
+     bam_ip_treated = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"
+         ),
+     txdb = txdb,
+     genome = "BSgenome.Hsapiens.UCSC.hg38",
+     mode = "full_transcript" 
+     )
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... OK
Fit GC curves with smoothing splines ... OK
Calculate offset matrix for bins ... OK
Detect peaks with GLM ...
 *** caught segfault ***
address (nil), cause 'unknown' 

Traceback:
 1: t(exp(modelMatrix %*% t(betaRes$beta_mat)))
 2: fitNbinomGLMs(objectNZ, betaTol = betaTol, maxit = maxit, useOptim = useOptim,     useQR = useQR, renameCols = renameCols, modelMatrix = modelMatrix,     minmu = minmu)
 3: nbinomWaldTest(dds)
 4: callPeaks(se, txdb, test_method, p_cutoff, exByGene, bin_size,     motif_based)
 5: force(x)
 6: quiet(.)
 7: callPeaks(se, txdb, test_method, p_cutoff, exByGene, bin_size,     motif_based) %>% quiet
 8: diffAnalysis(bam_IP = bam_ip, bam_input = bam_input, bam_IP_treated = bam_ip_treated,     bam_input_treated = bam_input_treated, txdb = txdb, genome = genome,     bin_size = bin_size, step_size = step_size, fragment_length = fragment_length,     strandness = strandness, test_method = test_method, p_cutoff = p_cutoff,     plot_gc = plot_gc, parallel = parallel, motif_based = motif_based,     motif_sequence = "DRACH")
 9: exomePeak2(bam_ip = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"),     bam_input = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"),     bam_input_treated = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"),     bam_ip_treated = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"),     txdb = txdb, genome = "BSgenome.Hsapiens.UCSC.hg38", mode = "full_transcript")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

As we would prefer to run exomepeak2 on the whole bam file, splitting into chromosomes was a workaround, however, this single chromosome is not running to completion. I've read that exomepeak2 is highly memory intensive, however, it says for large genomes that 4GB memory is needed...

We've successfully run exomepeak2 on exons only, but we need the full transcript data. Any help/guidance would be greatly appreciated!

ZW-xjtlu commented 1 year ago

Dear exomePeak2 User,

Thank you for highlighting the bug in the current version of exomePeak2. We discovered that the memory overflow issue originated from the removal of the read count filter for bins during last year's significant methodological update. This change was designed to enhance sensitivity, but it unfortunately led to memory problems when the mode is set to 'full_transcript'. We have addressed this issue in our latest update (version 1.9.3), which is now available on GitHub at exomePeak2 GitHub Repository. Note that the outcome in exon only mode will not be affected by this upgrade. To update your package, please use remotes::install_github("ZW-xjtlu/exomePeak2").

We appreciate your continued support and feedback.

Best regards, Zhen Wei