Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
27 stars 20 forks source link

backward compatibility of info() function #68

Closed shwong-tw closed 1 year ago

shwong-tw commented 1 year ago

Dear developer,

I would like to apply info() function on a CollapsedVCF object that was previously stored.

Using info() from VariantAnnotation version < 1.34.0 this works well; however using this function from version 1.42.1 it gives error as below: "C stack usage 7971012 is too close to the limit" I didn't try versions in between 1.34.0 and 1.42.1.

Would you kindly provide me with some insight on solving this issue.

Thank you very much!

vjcitn commented 1 year ago

thanks for report please supply sessionInfo() result after error ensure BiocManager::valid() is TRUE

it may take some time to fix

shwong-tw commented 1 year ago

Thanks for the prompt reply, I ran the BiocManager::valid() and it returns also sessionInfo(). Please find them below:

In this environment VariantAnnotation::info() gives error

R version 4.2.0 (2022-04-22) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.3.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 stats graphics grDevices utils datasets [7] methods base

other attached packages: [1] VariantAnnotation_1.42.1 Rsamtools_2.12.0
[3] Biostrings_2.64.1 XVector_0.36.0
[5] SummarizedExperiment_1.26.1 Biobase_2.56.0
[7] GenomicRanges_1.48.0 GenomeInfoDb_1.32.4
[9] IRanges_2.30.1 S4Vectors_0.34.0
[11] MatrixGenerics_1.8.1 matrixStats_0.62.0
[13] BiocGenerics_0.42.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.9 lattice_0.20-45
[3] prettyunits_1.1.1 png_0.1-7
[5] assertthat_0.2.1 digest_0.6.29
[7] utf8_1.2.2 BiocFileCache_2.4.0
[9] R6_2.5.1 RSQLite_2.2.20
[11] httr_1.4.3 pillar_1.7.0
[13] zlibbioc_1.42.0 rlang_1.0.6
[15] GenomicFeatures_1.48.4 progress_1.2.2
[17] curl_4.3.2 rstudioapi_0.13
[19] blob_1.2.3 Matrix_1.5-1
[21] BiocParallel_1.30.3 stringr_1.4.0
[23] RCurl_1.98-1.8 bit_4.0.4
[25] biomaRt_2.52.0 DelayedArray_0.22.0
[27] rtracklayer_1.56.1 compiler_4.2.0
[29] pkgconfig_2.0.3 tidyselect_1.1.2
[31] KEGGREST_1.36.3 tibble_3.1.7
[33] GenomeInfoDbData_1.2.8 codetools_0.2-18
[35] XML_3.99-0.13 fansi_1.0.3
[37] crayon_1.5.1 dplyr_1.0.10
[39] dbplyr_2.2.1 GenomicAlignments_1.32.1 [41] bitops_1.0-7 rappdirs_0.3.3
[43] grid_4.2.0 lifecycle_1.0.3
[45] DBI_1.1.3 magrittr_2.0.3
[47] cli_3.4.1 stringi_1.7.6
[49] cachem_1.0.6 xml2_1.3.3
[51] ellipsis_0.3.2 filelock_1.0.2
[53] vctrs_0.5.1 generics_0.1.3
[55] rjson_0.2.21 restfulr_0.0.15
[57] tools_4.2.0 bit64_4.0.5
[59] BSgenome_1.64.0 glue_1.6.2
[61] purrr_0.3.4 hms_1.1.2
[63] yaml_2.3.5 parallel_4.2.0
[65] fastmap_1.1.0 AnnotationDbi_1.58.0
[67] BiocManager_1.30.18 memoise_2.0.1
[69] BiocIO_1.6.0

Bioconductor version '3.15'

R version 4.0.0 (2020-04-24) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /usr/lib64/libblas.so.3.4.2 LAPACK: /usr/lib64/liblapack.so.3.4.2

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 parallel stats graphics grDevices utils
[7] datasets methods base

other attached packages: [1] VariantAnnotation_1.34.0 Rsamtools_2.4.0
[3] Biostrings_2.56.0 XVector_0.30.0
[5] SummarizedExperiment_1.20.0 Biobase_2.50.0
[7] MatrixGenerics_1.2.1 matrixStats_0.61.0
[9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
[11] IRanges_2.24.1 S4Vectors_0.28.1
[13] BiocGenerics_0.36.1

loaded via a namespace (and not attached): [1] Rcpp_1.0.8 lattice_0.20-41
[3] prettyunits_1.1.1 assertthat_0.2.1
[5] utf8_1.2.2 BiocFileCache_1.12.1
[7] R6_2.5.1 RSQLite_2.2.7
[9] httr_1.4.2 pillar_1.6.4
[11] zlibbioc_1.36.0 rlang_0.4.12
[13] GenomicFeatures_1.40.1 progress_1.2.2
[15] curl_4.3.2 rstudioapi_0.13
[17] blob_1.2.2 Matrix_1.3-4
[19] BiocParallel_1.24.1 stringr_1.4.0
[21] RCurl_1.98-1.5 bit_4.0.4
[23] biomaRt_2.44.4 DelayedArray_0.16.3
[25] rtracklayer_1.48.0 compiler_4.0.0
[27] pkgconfig_2.0.3 askpass_1.1
[29] openssl_1.4.6 tidyselect_1.1.1
[31] tibble_3.1.6 GenomeInfoDbData_1.2.4
[33] XML_3.99-0.6 fansi_1.0.2
[35] crayon_1.4.2 dplyr_1.0.7
[37] dbplyr_2.1.1 GenomicAlignments_1.24.0 [39] bitops_1.0-7 rappdirs_0.3.3
[41] grid_4.0.0 lifecycle_1.0.1
[43] DBI_1.1.1 magrittr_2.0.1
[45] stringi_1.7.6 cachem_1.0.6
[47] xml2_1.3.3 ellipsis_0.3.2
[49] vctrs_0.3.8 generics_0.1.1
[51] tools_4.0.0 bit64_4.0.5
[53] BSgenome_1.56.0 glue_1.6.0
[55] purrr_0.3.4 hms_1.1.1
[57] fastmap_1.1.0 AnnotationDbi_1.50.3
[59] BiocManager_1.30.10 memoise_2.0.0

Bioconductor version '3.11'

vjcitn commented 1 year ago

Have you tried 1) load the saved object without evaluating it 2) run newobj = updateObject([oldobj]) see if info works on newobj?

if you can make available the old VCF or an example that fails with new R that could be helpful if this updateObject approach does not work.

shwong-tw commented 1 year ago

I always load the RData where the object was stored -> hopefully this addressed the first suggestion. and renewing the object with updateObject() did not make it work.

Interestingly I just found that if I do info(test) it gives same error as before; however if I do info(test[1:nrow(test),]) then things works well again.

Unfortunately I cannot provide the old VCF as it contains sensitive data, and if I tried to subset it, test2= test[1:5,]; then info(test2) works well.

In any case, this already lead to a workaround to the issue I encountered. Whoever encounter this issue can renew the object by test= test[1:nrow(test),] to avoid the error. I'll probably close the issue here and thank you for the prompt feedback again :)

Have a nice weekend!

LiNk-NY commented 1 year ago

@shwong-tw Have you tried using the updateObject package to update your old instance? It may provide some benefits in that respect. Perhaps Hervé @hpages can comment further. Best, Marcel

vjcitn commented 1 year ago

Glad to hear about the workaround. I will fire up bioc 3.11 and see if I can export a VCF that will demonstrate the problem in current bioc. Then we can be more concrete about a repair.

shwong-tw commented 1 year ago

Hi Marcel, When I tried BiocGenerics::updateObject function the error stayed. I just installed updateObject package as you suggested, however I don't see relevant function for updating vcf object.

Thank you!

LiNk-NY commented 1 year ago

Sorry it wasn't entirely clear to me whether you have an .Rds or .Rda file or an actual object. For the first two options, you can use update_rds_file or update_rda_file in the package, respectively.

shwong-tw commented 1 year ago

Hi Marcel,

I used save() function to store the intermediate data, therefore I suppose it is a .Rda data. I just ran updateObject::update_rda_file on this intermediate file and it seems to be doing something: File test.Rda: load().. ok [10 object(s)]; updateObject(logical, check=FALSE).. no-op; updateObject(factor, check=FALSE).. no-op; updateObject(list, check=FALSE).. object updated; updateObject(list, check=FALSE).. object updated; updateObject(list, check=FALSE).. object updated; updateObject(list, check=FALSE).. object updated; updateObject(list, check=FALSE).. object updated; updateObject(numeric, check=FALSE).. no-op; updateObject(list, check=FALSE).. no-op; updateObject(list, check=FALSE).. object updated; saving file.. OK ==> 1

However after loading the updated object, info() still gives the same error. Other than that, I would suggest to warn the users that updateObject::update_rda_file overwrites the original file by default.

Thank you and have a nice weekend!

LiNk-NY commented 1 year ago

Hi @shwong-tw Thanks for testing the updateObject::update_rda_file. The point you make may be helpful for the documentation in the package (cc: @hpages). We will get back to you when we have a reproducible example. Best regards, Marcel

hpages commented 1 year ago

@shwong-tw At the root of the problem is that BiocGenerics::updateObject() doesn't seem to be able to fix your old CollapsedVCF instance. Using the updateObject package won't change that because all what this package does is provide some convenience wrappers around BiocGenerics::updateObject().

So we need to understand why BiocGenerics::updateObject() fails to fix your old CollapsedVCF instance. However this is very hard without having access to it. Maybe you can run the following:

library(VariantAnnotation)
load(...) or data(...)

vcf  # try to display the object

class(vcf@info)

vcf <- BiocGenerics::updateObject(vcf, verbose=TRUE)

vcf  # try to display the object again

class(vcf@info)

vcf_info <- info(vcf)
class(vcf_info)
vcf_info

and share the output here? Note that you want to do this with the most recent version of Bioconductor that you have access to (seems like it's BioC 3.15 for you but note that the most current version is BioC 3.16, I suggest that you update your installation ASAP).

I have a feeling that the problem is that vcf@info is an old DataFrame instance that needs to be replaced with a DFrame instance but BiocGenerics::updateObject(vcf) doesn't do that (it ignores the slot).

Thanks!

hpages commented 1 year ago

@shwong-tw

So here's one way to reproduce this (with BioC 3.16):

library(VariantAnnotation)
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
class(vcf@info)
# [1] "DFrame"
# attr(,"package")
# [1] "S4Vectors"
class(vcf@info) <- "DataFrame"
info(vcf)
# Error: C stack usage  7969924 is too close to the limit

See my sessionInfo() below.

FWIW I just added an updateObject() method for VCF objects to VariantAnnotation 1.44.1 (BioC 3.16) and 1.45.1 (BioC 3.17). This method should be able to fix the info and fixed slots of your old CollapsedVCF instances.

These new VariantAnnotation versions should propagate and become available via BiocManager::install() in the next 24-48 hours or so. However you will first need to update your installation to BioC 3.16 to get access to VariantAnnotation 1.44.1.

After you've tried updateObject() on your old CollapsedVCF instance and made sure that everything works as expected with the updated instance, you should save() it again to disk so you don't have to call updateObject() again on it next time you load it.

If you have other old serialized S4 instances on your disk, say in the path/to/saved/objects/ folder, you should be able to run updateSerializedObjects("path/to/saved/objects", recursive=TRUE) to update them all. Yes updateSerializedObjects() is going to perform in-place replacement of the .rda and/or .rds files found in path/to/saved/objects/ but only if the serialized objects stored in those files actually needed to be updated.

Best, H.

sessionInfo():

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.16-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.16-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VariantAnnotation_1.44.0    Rsamtools_2.14.0           
 [3] Biostrings_2.66.0           XVector_0.38.0             
 [5] SummarizedExperiment_1.28.0 Biobase_2.58.0             
 [7] GenomicRanges_1.50.2        GenomeInfoDb_1.34.9        
 [9] IRanges_2.32.0              S4Vectors_0.36.1           
[11] MatrixGenerics_1.10.0       matrixStats_0.63.0         
[13] BiocGenerics_0.44.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10              lattice_0.20-45          prettyunits_1.1.1       
 [4] png_0.1-8                assertthat_0.2.1         digest_0.6.31           
 [7] utf8_1.2.3               BiocFileCache_2.6.0      R6_2.5.1                
[10] RSQLite_2.2.20           httr_1.4.4               pillar_1.8.1            
[13] zlibbioc_1.44.0          rlang_1.0.6              GenomicFeatures_1.50.4  
[16] progress_1.2.2           curl_5.0.0               blob_1.2.3              
[19] Matrix_1.5-3             BiocParallel_1.32.5      stringr_1.5.0           
[22] RCurl_1.98-1.10          bit_4.0.5                biomaRt_2.54.0          
[25] DelayedArray_0.24.0      rtracklayer_1.58.0       compiler_4.2.2          
[28] pkgconfig_2.0.3          tidyselect_1.2.0         KEGGREST_1.38.0         
[31] tibble_3.1.8             GenomeInfoDbData_1.2.9   codetools_0.2-19        
[34] XML_3.99-0.13            fansi_1.0.4              crayon_1.5.2            
[37] dplyr_1.1.0              dbplyr_2.3.0             GenomicAlignments_1.34.0
[40] bitops_1.0-7             rappdirs_0.3.3           grid_4.2.2              
[43] lifecycle_1.0.3          DBI_1.1.3                magrittr_2.0.3          
[46] cli_3.6.0                stringi_1.7.12           cachem_1.0.6            
[49] xml2_1.3.3               ellipsis_0.3.2           filelock_1.0.2          
[52] vctrs_0.5.2              generics_0.1.3           rjson_0.2.21            
[55] restfulr_0.0.15          tools_4.2.2              bit64_4.0.5             
[58] BSgenome_1.66.2          glue_1.6.2               hms_1.1.2               
[61] yaml_2.3.7               parallel_4.2.2           fastmap_1.1.0           
[64] AnnotationDbi_1.60.0     memoise_2.0.1            BiocIO_1.8.0