A resource should not been made available until it's fully downloaded

hpages commented 1 year ago

Reporting this for ExperimentHub but I suspect that AnnotationHub and BiocFileCache have similar issues.

In session 1 do:

library(ExperimentHub)
hub <- ExperimentHub()
removeResources(hub, "EH1039")
fname <- hub[["EH1039"]]  # this will start downloading the resource to the cache

While the EH1039 resource is being downloaded in session 1, it's immediately made available in session 2.

In session 2 do:

library(rhdf5)
library(ExperimentHub)
hub <- ExperimentHub()

## Don't wait for the download in session 1 to finish to do this:
fname <- hub[["EH1039"]]
rhdf5::ls(fname)
# Error in H5Fopen(file, flags = flags, fapl = fapl, native = native) : 
#   HDF5. File accessibility. Unable to open file.

The classic way around this is to have some kind of lock mechanism that lets other sessions know that the resource is currently being downloaded. The session that starts the download puts the lock on the resource, only if the resource is not already locked. Other sessions trying to access the resource then just wait for the lock to be removed before they return the resource to the user.

This will also have the benefit of preventing 2 sessions from starting the same download. Right now this is possible if for example session 2 does fname <- hub[["EH1039", force=TRUE]] while session 1 is still downloading the resource. I don't know what the exact consequences of this are but concurrent downloads of the same resource seem like something that should not be permitted.

There's also the question of what happens when the session that started a download dies before the download is complete. Right now it seems that we end up with a corrupted resource in the cache, unless the download was cleanly interrupted at the command line with <CTRL+C> in session 1, in which case it seems that ExperimentHub is able to remove the corrupted resource from the cache. But in case of a more brutal death (e.g. the user inadvertently kills their RStudio session or kills the terminal where they were running R at the command line, or the server is rebooted), then the resource that ends up in the cache will be corrupted. This can be avoided by making the "download + register the resource in the sqlite db" sequence an atomic operation. Note that is something that was brought up here last year.

Thanks, H.

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 23.04

Matrix products: default
BLAS:   /home/hpages/R/R-4.3.0/lib/libRblas.so 
LAPACK: /home/hpages/R/R-4.3.0/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] TENxBrainData_1.21.0        SingleCellExperiment_1.23.0
 [3] SummarizedExperiment_1.31.1 Biobase_2.61.0             
 [5] ExperimentHub_2.9.1         AnnotationHub_3.9.2        
 [7] BiocFileCache_2.9.1         dbplyr_2.3.4               
 [9] HDF5Array_1.29.3            rhdf5_2.45.1               
[11] DelayedArray_0.27.10        SparseArray_1.1.12         
[13] S4Arrays_1.1.6              abind_1.4-5                
[15] MatrixGenerics_1.13.1       matrixStats_1.0.0          
[17] Matrix_1.6-1.1              GenomicRanges_1.53.2       
[19] GenomeInfoDb_1.37.6         IRanges_2.35.3             
[21] S4Vectors_0.39.3            BiocGenerics_0.47.0        

loaded via a namespace (and not attached):
 [1] KEGGREST_1.41.4               lattice_0.21-9               
 [3] rhdf5filters_1.13.5           vctrs_0.6.3                  
 [5] tools_4.3.0                   bitops_1.0-7                 
 [7] generics_0.1.3                curl_5.1.0                   
 [9] AnnotationDbi_1.63.2          tibble_3.2.1                 
[11] fansi_1.0.5                   RSQLite_2.3.1                
[13] blob_1.2.4                    pkgconfig_2.0.3              
[15] lifecycle_1.0.3               GenomeInfoDbData_1.2.10      
[17] compiler_4.3.0                Biostrings_2.69.2            
[19] httpuv_1.6.11                 htmltools_0.5.6.1            
[21] RCurl_1.98-1.12               yaml_2.3.7                   
[23] interactiveDisplayBase_1.39.0 pillar_1.9.0                 
[25] later_1.3.1                   crayon_1.5.2                 
[27] ellipsis_0.3.2                cachem_1.0.8                 
[29] mime_0.12                     tidyselect_1.2.0             
[31] digest_0.6.33                 purrr_1.0.2                  
[33] dplyr_1.1.3                   BiocVersion_3.18.0           
[35] fastmap_1.1.1                 grid_4.3.0                   
[37] cli_3.6.1                     magrittr_2.0.3               
[39] utf8_1.2.3                    withr_2.5.1                  
[41] filelock_1.0.2                promises_1.2.1               
[43] rappdirs_0.3.3                bit64_4.0.5                  
[45] XVector_0.41.1                httr_1.4.7                   
[47] bit_4.0.5                     png_0.1-8                    
[49] memoise_2.0.1                 shiny_1.7.5                  
[51] rlang_1.1.1                   Rcpp_1.0.11                  
[53] xtable_1.8-4                  glue_1.6.2                   
[55] DBI_1.1.3                     BiocManager_1.30.22          
[57] R6_2.5.1                      Rhdf5lib_1.23.2              
[59] zlibbioc_1.47.0

lshep commented 1 year ago

We do have a lock in BFC but apparently that isn't sufficient enough. We can look into it.

romanzenka commented 1 year ago

I think this is a related, but different problem. The problem you describe is about resource acquisition going through stages "we do not have it" -> "downloading" -> "available". A caching mechanism not aware of the fact that a resource is already being downloaded will waste downloads, as multiple processes try and only one (hopefully) wins and gets the cache updated.

The problem with marking a resource as "downloading" is that if the download fails, the item will become a deadlock and everyone will wait forever for the download to finish. So you would have to be extremely careful and develop traps so that you unlock these resources if your download process dies prior to download completion. And even then it can fail, so you would have to implement timeouts to clear downloads after a while of inactivity. This all is necessary to have a truly robust solution, but I think it is also quite hard to do correctly.

Lastly, SQLite is not a best mechanism for synchronizing large number of very active processes. It does support rudimentary locking on the file level, but it is not bulletproof (read more here: https://www.sqlite.org/lockingv3.html ). So if you are after a genuinely robust solution, the package would have to support other, more robust database engines - things that you can only do with a database server.

Bioconductor / ExperimentHub

A resource should not been made available until it's fully downloaded #32