Zilong-Li / vcfppR

The fastest VCF/BCF parser in R https://doi.org/10.1093/bioinformatics/btae049
https://zilong-li.github.io/vcfppR/
Other
9 stars 1 forks source link

Error: cannot allocate vector of size 12.2 Gb #4

Closed Truongphikt closed 1 month ago

Truongphikt commented 1 month ago

Hi, I'm excited to use this tool to evaluate my imputation results. But I bump into this error with my test cases even though I have run successfully with the repo's test set.

This is my R script:

library(vcfppR)

rawvcf = "<path_to_file>/SAS-Axiom_JAPONICA_chr12_filterd.vcf.gz"
phasedvcf= "<path_to_file>/test_from_prs203/SAS_chr12_extract.vcf.gz"
maf_file = "<path_to_file>/test_from_prs203/12_maf.txt"

res <- vcfcomp(test = rawvcf, truth = phasedvcf,
               stats = "r2", 
               formats = c("GT","GT"))

# Save the plot to a PNG file
png(filename = "data_vcfplot_output.png", width = 800, height = 600)

par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")

# Close the graphics device
dev.off()

png(filename = "data_phasing_output.png", width = 800, height = 600)
res <- vcfcomp(test = rawvcf, truth = phasedvcf,
               stats = "pse",
               return_pse_sites = TRUE)
#> stats F1 or NRC or PSE only uses GT format
vcfplot(res, which=1:2, main = "Phasing switch error", ylab = "HG00673,NA10840")
dev.off()

And I got the error at "phasing image" after waiting a while:

> 
> 
> png(filename = "data_phasing_output.png", width = 800, height = 600)
> res <- vcfcomp(test = rawvcf, truth = phasedvcf,
+                stats = "pse",
+                return_pse_sites = TRUE)
stats F1 or NRC or PSE only uses GT format
Error: cannot allocate vector of size 12.2 Gb
Execution halted
Zilong-Li commented 1 month ago

Thanks. Can you show me the machine information and sessionInfo() ? Also, if you have limited memory, you may need to run gc() few times between many vcfcomp() calls to cut down peak RAM.

Zilong-Li commented 1 month ago

Another tip on using vcftable and vcfcomp, is setting info=false, which can save a lot of memory when the VCF has lots of annotation information in INFO field.

Truongphikt commented 1 month ago

@Zilong-Li Thank you for your rapid reply. I use vcfppR v0.4.6, and this is sessionInfo():

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/envs/vcfppR/lib/libopenblasp-r0.3.27.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vcfppR_0.4.7

loaded via a namespace (and not attached):
[1] compiler_4.2.3   Rcpp_1.0.13      codetools_0.2-20

Besides, I also tried info=false on repo's test set but I got an error:

> res <- vcfcomp(test = rawvcf, truth = phasedvcf,
+                stats = "r2", region = "chr21:1-5100000",
+                info = false,
+                formats = c("GT","GT"))
Error in tableGT(vcffile, region, samples, "GT", ids, qual, pass, info,  : 
  object 'false' not found
Calls: vcfcomp ... tryCatchList -> tryCatchOne -> <Anonymous> -> vcftable -> tableGT
Execution halted
Zilong-Li commented 1 month ago

You should use upper case as Boolean in R. Hence info=FALSE

Truongphikt commented 1 month ago

@Zilong-Li Do you have another solution for memory errors? When I try to run the repo's test set on all ranges in chr21 (tutorial is chr21:1-5100000) the error is still there, even though I granted 150.Gb for process.

Command error:
  Error: cannot allocate vector of size 44.6 Gb
  Execution halted
Zilong-Li commented 1 month ago

Hey, did you read my recommendation on gc() ? This is a R memory thing that you have to learn how to manage. There is indeed another solution that avoids allocate all memory at the same time. Remind me if I forget to write a document on memory efficient usage for a long time.

Truongphikt commented 1 month ago

Sorry for not clearly saying. When I run my process, I only execute the vcfcomp function once, so gc() is irrelevant, isn't it? This is my command:

library(vcfppR)
res <- vcfcomp(test = "./20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr21.recalibrated_variants.vcf.gz", truth = "./1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz", info=FALSE,
                 stats = "r2",
                 af="21_maf.txt",
                 bin=seq(0, 0.5, length.out = 20),
                 by.sample = FALSE,
                 by.variant = TRUE,
                 formats = c("GT","GT"),
                 )
write.csv(as.data.frame(res$r2), file = "none_chr21_snp-wise.csv", row.names = TRUE)
# Save the plot to a PNG file
png(filename = "none_chr21_snp-wise.png", width = 800, height = 600)

par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")
dev.off()

Even when I use gc(), another error pops up (also related to memory):

  *** caught segfault ***
  address 0xc, cause 'memory not mapped'
  Error: no more error handlers available (recursive errors?); invoking 'abort' restart
  Execution halted
  Warning message:
  system call failed: Cannot allocate memory 

As far as I can see, these are a few memory error I met:

[E::bgzf_uncompress] Call to inflateInit2 failed: out of memory
  Error: std::bad_alloc
  Execution halted
Error: cannot allocate vector of size <...> Gb
  Execution halted

Thank you.