HenrikBengtsson / future.batchtools

:rocket: R package future.batchtools: A Future API for Parallel and Distributed Processing using batchtools
https://future.batchtools.futureverse.org
84 stars 9 forks source link

Proposed bugfix for batchtools reveals bug in future.batchtools? #74

Open stuvet opened 3 years ago

stuvet commented 3 years ago

I have submitted a simple pull request for a bugfix in batchtools which fuelled some of its behaviour mentioned in #73.

I'm writing here because this proposed bugfix reveals an error in future.batchtools, though based on the future.debug output I don't believe the two are related - just that the batchtools bug previously threw the error first.

Describe the bug When a slurm worker is availble for the job, (& when the batchtools::waitForFile is not required by batchtools::getLog -> batchtools::readLog), everything functions correctly (note the status):

[01:10:21.286] Job Registry
[01:10:21.286]   Backend  : Slurm
[01:10:21.286]   File dir : /home/stuart/pipelines/production_reports/.future/20210605_011015-MndN2h/todo_chains_5cd13a69_900365739
[01:10:21.286]   Work dir : /home/stuart/pipelines/production_reports
[01:10:21.286]   Jobs     : 0
[01:10:21.286]   Seed     : 26422
[01:10:21.286]   Writeable: TRUE
[01:10:21.287] Attaching 1 packages (‘targets’) ...
[01:10:21.294] Attaching 1 packages (‘targets’) ... DONE
[01:10:22.894] batchtools::batchMap()
Adding 1 jobs ...
[01:10:22.919] Created BatchtoolsFuture future #1
Submitting 1 jobs in 1 chunks using cluster functions 'Slurm' ...
[01:10:24.107] Launched future #1
[01:10:24.109] plan(): Setting new future strategy stack:
[01:10:24.109] List of future strategies:
[01:10:24.109] 1. FutureStrategy:
[01:10:24.109]    - args: function (..., envir = parent.frame(), template = "batchtools.slurm.tmpl", workers = 8, resources = list(bucket = "targets-production-reports-209813", walltime = 3600, memory = 2048, ncpus = 1, ntasks = 1, partition = "ph4"))
[01:10:24.109]    - tweaked: TRUE
[01:10:24.109]    - call: future::plan(slurm)
[01:10:24.111] plan(): nbrOfWorkers() = 8
[01:10:24.112] status() for BatchtoolsFuture ...
[01:10:24.377] - status: ‘defined’, ‘running’, ‘started’, ‘submitted’, ‘system’
[01:10:24.377] status() for BatchtoolsFuture ... done

But when a slurm worker needs to be provisioned to run the job (& so the batchtools::waitForFile will also be called), the initial result of the call to future.batchtools::status(future) in future.batchtools::await is incorrect:

[01:43:55.613] Job Registry
[01:43:55.613]   Backend  : Slurm
[01:43:55.613]   File dir : /home/stuart/pipelines/production_reports/.future/20210605_014355-hJknNd/todo_chains_dced720f_1866986666
[01:43:55.613]   Work dir : /home/stuart/pipelines/production_reports
[01:43:55.613]   Jobs     : 0
[01:43:55.613]   Seed     : 23275
[01:43:55.613]   Writeable: TRUE
[01:43:55.614] Attaching 1 packages (‘targets’) ...
[01:43:55.622] Attaching 1 packages (‘targets’) ... DONE
[01:43:57.447] batchtools::batchMap()
Adding 1 jobs ...
[01:43:57.482] Created BatchtoolsFuture future #1
Submitting 1 jobs in 1 chunks using cluster functions 'Slurm' ...
[01:43:58.682] Launched future #1
[01:43:58.683] plan(): Setting new future strategy stack:
[01:43:58.683] List of future strategies:
[01:43:58.683] 1. FutureStrategy:
[01:43:58.683]    - args: function (..., envir = parent.frame(), template = "batchtools.slurm.tmpl", workers = 8, resources = list(bucket = "targets-production-reports-209813", walltime = 3600, memory = 2048, ncpus = 1, ntasks = 1, partition = "ph4"))
[01:43:58.683]    - tweaked: TRUE
[01:43:58.683]    - call: future::plan(slurm)
[01:43:58.684] plan(): nbrOfWorkers() = 8
[01:43:58.685] status() for BatchtoolsFuture ...
[01:43:58.778] - status: ‘defined’, ‘expired’, ‘submitted’
[01:43:58.779] status() for BatchtoolsFuture ... done
[01:43:58.779] status() for BatchtoolsFuture ...
[01:43:58.870] - status: ‘defined’, ‘expired’, ‘submitted’
[01:43:58.871] status() for BatchtoolsFuture ... done
[01:43:58.872] batchtools::waitForJobs() ...
[01:44:03.310] - batchtools::waitForJobs(): FALSE
[01:44:03.310] status() for BatchtoolsFuture ...
[01:44:03.414] - status: ‘defined’, ‘expired’, ‘submitted’
[01:44:03.414] status() for BatchtoolsFuture ... done
[01:44:03.414] - status(): ‘defined’, ‘expired’, ‘submitted’
[01:44:03.414] batchtools::waitForJobs() ... done
[01:44:03.415] Results:
[01:44:03.415] status() for BatchtoolsFuture ...
[01:44:03.508] - status: ‘defined’, ‘expired’, ‘submitted’
[01:44:03.509] status() for BatchtoolsFuture ... done
[01:44:03.509] status() for BatchtoolsFuture ...
[01:44:03.601] - status: ‘defined’, ‘expired’, ‘submitted’
[01:44:03.601] status() for BatchtoolsFuture ... done

At this point, logging inserted into batchtools:::waitForFile begins to appear. No more future.debug messages appear until the log file has been detected (now, after the proposed bugfix) & batchtools::waitForFile exits.

[01:44:47.684] plan(): Setting new future strategy stack:
[01:44:47.684] List of future strategies:
[01:44:47.684] 1. sequential:
[01:44:47.684]    - args: function (..., envir = parent.frame())
[01:44:47.684]    - tweaked: FALSE
[01:44:47.684]    - call: future::plan(future::sequential)
[01:44:47.685] plan(): nbrOfWorkers() = 1
Error : BatchtoolsExpiration: Future ('todo_chains_dced720f') expired (registry path /home/stuart/pipelines/production_reports/.future/20210605_014355-hJknNd/todo_chains_dced720f_1866986666).. No logged output exist.
In addition: There were 17 warnings (use warnings() to see them)
[01:44:47.724] status() for BatchtoolsFuture ...
[01:44:47.829] - status: ‘defined’, ‘running’, ‘started’, ‘submitted’, ‘system’
[01:44:47.829] status() for BatchtoolsFuture ... done

To be clear, the logged output does exist, and continues to be written by the running job after future.batchtools::await flags the job as expired & exits.

Please let me know if there's anything else I can do to help resolve this issue.

Expected behavior future.batchtools::await waits for the running job to exit, even when workers need to be provisioned or batchtools::waitForFile is triggered.

Session information

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /apps/R/4.1.0/lib64/R/lib/libRblas.so
LAPACK: /apps/R/4.1.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] defineDataAccess_0.0.0.9002 forcats_0.5.1               stringr_1.4.0              
 [4] purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                
 [7] tibble_3.1.2                ggplot2_3.3.3               tidyverse_1.3.1            
[10] brms_2.15.0                 Rcpp_1.0.6                  dplyr_1.0.6                
[13] future.batchtools_0.10.0    batchtools_0.9.15           stringi_1.6.2              
[16] future_1.21.0               stantargets_0.0.2.9000      tarchetypes_0.2.0          
[19] targets_0.4.2               aws.s3_0.3.21               googleCloudStorageR_0.6.0  
[22] data.table_1.14.0          

loaded via a namespace (and not attached):
  [1] utf8_1.2.1            R.utils_2.10.1        tidyselect_1.1.1     
  [4] lme4_1.1-27           htmlwidgets_1.5.3     grid_4.1.0           
  [7] aws.signature_0.6.0   munsell_0.5.0         base64url_1.4        
 [10] codetools_0.2-18      nleqslv_3.3.2         DT_0.18              
 [13] miniUI_0.1.1.1        withr_2.4.2           Brobdingnag_1.2-6    
 [16] colorspace_2.0-1      fst_0.9.4             knitr_1.33           
 [19] rstudioapi_0.13       stats4_4.1.0          bayesplot_1.8.0      
 [22] listenv_0.8.0         emmeans_1.6.0         rstan_2.26.1         
 [25] bit64_4.0.5           farver_2.1.0          bridgesampling_1.1-2 
 [28] coda_0.19-4           parallelly_1.25.0     vctrs_0.3.8          
 [31] generics_0.1.0        TH.data_1.0-10        xfun_0.23            
 [34] R6_2.5.0              markdown_1.1          gamm4_0.2-6          
 [37] projpred_2.0.2        arrow_4.0.0.1         cachem_1.0.5         
 [40] assertthat_0.2.1      promises_1.2.0.1      scales_1.1.1         
 [43] multcomp_1.4-17       debugme_1.1.0         gtable_0.3.0         
 [46] globals_0.14.0        processx_3.5.2        sandwich_3.0-1       
 [49] rlang_0.4.11          miniCRAN_0.2.14       splines_4.1.0        
 [52] gargle_1.1.0          broom_0.7.6           brew_1.0-6           
 [55] checkmate_2.0.0       inline_0.3.18         modelr_0.1.8         
 [58] yaml_2.2.1            reshape2_1.4.4        abind_1.4-5          
 [61] threejs_0.3.3         crosstalk_1.1.1       backports_1.2.1      
 [64] httpuv_1.6.1          rsconnect_0.8.17      tensorA_0.36.2       
 [67] tools_4.1.0           feather_0.3.5         ellipsis_0.3.2       
 [70] posterior_0.1.6       ggridges_0.5.3        plyr_1.8.6           
 [73] base64enc_0.1-3       progress_1.2.2        ps_1.6.0             
 [76] prettyunits_1.1.1     openssl_1.4.4         zoo_1.8-9            
 [79] haven_2.4.1           fs_1.5.0              furrr_0.2.2          
 [82] magrittr_2.0.1        reprex_2.0.0          colourpicker_1.1.0   
 [85] mvtnorm_1.1-1         matrixStats_0.58.0    stringfish_0.15.1    
 [88] evaluate_0.14         hms_1.1.0             qs_0.24.1            
 [91] shinyjs_2.0.0         mime_0.10             xtable_1.8-4         
 [94] shinystan_2.5.0       readxl_1.3.1          gridExtra_2.3        
 [97] rstantools_2.1.1      compiler_4.1.0        credentials_1.3.0    
[100] V8_3.4.2              crayon_1.4.1          minqa_1.2.4          
[103] R.oo_1.24.0           StanHeaders_2.26.1    htmltools_0.5.1.1    
[106] mgcv_1.8-35           later_1.2.0           RcppParallel_5.1.4   
[109] lubridate_1.7.10      DBI_1.1.1             RApiSerialize_0.1.0  
[112] dbplyr_2.1.1          MASS_7.3-54           connectapi_0.1.0.9026
[115] rappdirs_0.3.3        boot_1.3-28           sys_3.4              
[118] Matrix_1.3-3          cli_2.5.0             R.methodsS3_1.8.1    
[121] parallel_4.1.0        igraph_1.2.6          pkgconfig_2.0.3      
[124] xml2_1.3.2            dygraphs_1.1.1.6      pins_0.4.5           
[127] estimability_1.3      rvest_1.0.0           distributional_0.2.2 
[130] callr_3.7.0           digest_0.6.27         rmarkdown_2.8        
[133] cellranger_1.1.0      curl_4.3.1            shiny_1.6.0          
[136] gtools_3.8.2          nloptr_1.2.2.2        lifecycle_1.0.0      
[139] nlme_3.1-152          jsonlite_1.7.2        cmdstanr_0.4.0.9000  
[142] askpass_1.1           fansi_0.4.2           pillar_1.6.1         
[145] lattice_0.20-44       loo_2.4.1             fastmap_1.1.0        
[148] httr_1.4.2            pkgbuild_1.2.0        survival_3.2-11      
[151] googleAuthR_1.4.0     glue_1.4.2            xts_0.12.1           
[154] zip_2.1.1             gert_1.3.0            shinythemes_1.2.0    
[157] bit_4.0.4             memoise_2.0.0        
stuvet commented 3 years ago

This seems to be caused by future.batchtools calling batchtools::getStatus, which defaults to 'expired' if is.na(status). It doesn't affect batchtools as it's waiting for the log to appear in waitForFile.

Here's an example output of batchtools::getStatus while the worker is provisioning (& batchtools is in the waitForFile loop):

## result of batcthools::getStatus
Status for 1 jobs at 2021-06-05 09:20:42:
  Submitted    : 1 (100.0%)
  -- Queued    : 0 (  0.0%)
  -- Started   : 0 (  0.0%)
  ---- Running : 0 (  0.0%)
  ---- Done    : 0 (  0.0%)
  ---- Error   : 0 (  0.0%)
  ---- Expired : 1 (100.0%)

## The reg entry during provisioning
   batch.id job.id def.id  submitted started done error mem.used resource.id
1:     5955      1      1 1622885612      NA   NA  <NA>       NA           1
   log.file                            job.hash             job.name status
1:     <NA> jobc59eb7d6a4f24ec81611dd0a592cb9de todo_chains_2e8f2ae4   <NA>

Perhaps an explicit call to batchtools::waitForFile in future.batchtools (linked to an option to set fs.latency from #73?) could be a workaround for this, though I'll take a closer look at batchtools to see if I can adapt the getStatus result for this case - it may be possible to differentiate 'provisioning' from 'expired' if the log.file is populated in the reg entry, though this strategy may cause jobs to run indefinitely if they expire before the log file is produced?

stuvet commented 3 years ago

I've taken a shot at this this problem & have submitted a PR.

With this fix, there seems to be no need to change fs.latency from 65, so #73 isn't necessary.

coatless commented 3 years ago

@stuvet I tried out the development version of batchtools with the proposed change. However, when I/O is lagged on a distributed system like so:

screenshot of cluster load

The job still automatically fails once loaded with:

Error: Log file '%s' for job with id %i not available

I've also attempted a crude hack to set fs.latency via setting up a default registry file at ~/.batchtools.conf.R per batchtools registry vignette with:

cluster.functions = batchtools::makeClusterFunctionsSlurm(
  scheduler.latency = 60, # default is 1
  fs.latency = 120)  # default is 65

However, even with inserting a shim directly into batchtools, jobs still will fail.

stuvet commented 3 years ago

This was my first attempt (before I understood the problem completely) & it didn't solve the problem entirely.

Perhaps this may help - obviously made for users of targets but it was hitting the same problem you are, via dependencies on future.batchtools & batchtools.

Hope it helps you.

https://github.com/ropensci/targets/discussions/570

Also if I remember right there are some tweaks for the mount flags in /etc/fstab that could help files appear faster in heavy I/O scenarios - if the bugfixes don't help perhaps the workers actually can't see the log files before they timeout. I'm no expert here. I did take a long look at the flags here & changed some. I'll try to find the slurm-specific documentation & I'll update with the flags that worked for me.

EDIT: on second thoughts is it possible the spike in I/O you're seeing actually reflects/is associated with recruitment of new worker machines? If so, that's exactly what I was seeing without the I/o - I was scaling up from 0 & seeing this reliably.