HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

plan(multicore) : <FutureError: Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 71). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive.> #673

Closed Ash2023qqq closed 1 year ago

Ash2023qqq commented 1 year ago

Hi HenrikBengtsson

I am using future's "multicore" for my application's asynchronous call. It is an model that processes the text and classifies into different labels using keras and tensorflow libraries with help of reticulate package. As the model works on linux based system, I am using "multicore" process. The model starts with multicore process and ends with plan(sequential) to kill the forked processes.

This application basically does the modelling and post the messages to azure eventhub. When the application is triggered once at a time, there is no error observed from the future set. But when the application is triggered multiple times back to back, few transactions are succeeding as expected and few are failing and few are getting lost too. Transaction's failure/success messages are received at the eventhub side and no evidence of the lost messages at eventhub side.

Describe the bug Scenario 1 : When multiple transactions are triggered next to next : following logs are given

[11:31:33.707] requestCore(): workers = 12
[11:31:33.707] **Poll #1 (0): usedCores() = 12, workers = 12
Warning in mccollect(jobs = jobs, wait = TRUE) :
  1 parallel job did not deliver a result
[11:31:33.923] Attempting to find a working pid_exists_*() function ...
Call: tools::pskill(1, signal = 0L)
[1] TRUE
Call: tools::pskill(1, signal = 0L)
[1] TRUE**
[11:31:33.987] - success
[11:31:33.988] Attempting to find a working pid_exists_*() function ... done
[11:31:33.989] Detected a FutureError while FutureRegistry collecting results:
[11:31:33.989] **<FutureError: Failed to retrieve the result of MulticoreFuture (<none>) from the forked worker (on localhost; PID 71). Post-mortem diagnostic: No process exists with this PID, i.e. the forked localhost worker is no longer alive.>**
[11:31:33.989]
[11:31:33.989]
[11:31:33.989] DEBUG: BEGIN TROUBLESHOOTING HELP
[11:31:33.989] Future involved:
[11:31:33.989] MulticoreFuture:
[11:31:33.989] Label: ‘<none>’
[11:31:33.989] Expression:
[11:31:33.989] dataRead(lifecycle_name)
[11:31:33.989] Lazy evaluation: FALSE
[11:31:33.989] Asynchronous evaluation: TRUE
[11:31:33.989] Local evaluation: TRUE
[11:31:33.989] Environment: config = list(7)
[11:31:33.989] Capture standard output: TRUE
[11:31:33.989] Capture condition classes: ‘condition’
[11:31:33.989] Globals: <none>
[11:31:33.989] Packages: <none>
[11:31:33.989] L'Ecuyer-CMRG RNG seed: <none> (seed = FALSE)
[11:31:33.989] Resolved: TRUE
[11:31:33.989] Value: <not collected>
[11:31:33.989] Conditions captured: <none>
[11:31:33.989] Early signaling: FALSE
[11:31:33.989] Owner process: befb967f-9055-1d8c-ef52-4eb1e2ec7287
[11:31:33.989] Class: ‘MulticoreFuture’, ‘MultiprocessFuture’, ‘Future’, ‘environment’
[11:31:33.989]
[11:31:33.989] DEBUG: END TROUBLESHOOTING HELP
[11:31:34.088] MulticoreFuture started
[11:31:34.089] plan(): Setting new future strategy stack:
[11:31:34.090] List of future strategies:
[11:31:34.090] 1. sequential:
[11:31:34.090]    - args: function (..., envir = parent.frame())
[11:31:34.090]    - tweaked: FALSE
[11:31:34.090]    - call: future::plan(sequential)
[11:31:34.091] plan(): Setting new future strategy stack:
[11:31:34.092] plan(): nbrOfWorkers() = 1
[11:31:34.092] List of future strategies:
[11:31:34.092] 1. sequential:
[11:31:34.092]    - args: function (..., envir = parent.frame())
[11:31:34.092]    - tweaked: FALSE
[11:31:34.092]    - call: NULL
[11:31:34.094] plan(): nbrOfWorkers() = 1
[11:31:38.088] plan(): Setting new future strategy stack:
[11:31:38.089] List of future strategies:
[11:31:38.089] 1. multicore:
[11:31:38.089]    - args: function (..., envir = parent.frame(), workers = c(system = 12L))
[11:31:38.089]    - tweaked: TRUE
[11:31:38.089]    - call: future::plan(multicore, workers = availableCores())
[11:31:38.090] plan(): nbrOfWorkers() = 12
WARNING:tensorflow:SavedModel saved prior to TF 2.5 detected when loading Keras model. Please ensure that you are saving the model with model.save() or tf.keras.models.save_model(), *NOT* tf.saved_model.save(). To confirm, there should be a file named "keras_metadata.pb" in the SavedModel directory.
2023-03-29 11:31:39.315281: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-11-openjdk-amd64/lib/server
2023-03-29 11:31:39.315322: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-29 11:31:39.315362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-compliance-bbfc7dfb4-4fmpq): /proc/driver/nvidia/version does not exist
2023-03-29 11:31:39.315921: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[11:31:58.687] plan(): Setting new future strategy stack:
[11:31:58.688] List of future strategies:
[11:31:58.688] 1. multicore:
[11:31:58.688]    - args: function (..., envir = parent.frame(), workers = c(system = 12L))
[11:31:58.688]    - tweaked: TRUE
[11:31:58.688]    - call: future::plan(multicore, workers = availableCores())
[11:31:58.689] plan(): nbrOfWorkers() = 12
2023-03-29 11:32:28.937788: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2023-03-29 11:32:28.938480: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2494140000 Hz
[11:34:39.435] plan(): Setting new future strategy stack:
[11:34:39.435] List of future strategies:
[11:34:39.435] 1. multicore:
[11:34:39.435]    - args: function (..., envir = parent.frame(), workers = c(system = 12L))
[11:34:39.435]    - tweaked: TRUE
[11:34:39.435]    - call: future::plan(multicore, workers = availableCores())
[11:34:39.436] plan(): nbrOfWorkers() = 12

eventhub message when the above future error occurs :

{"id":"122333","cdrExecId":"434343","studyId":"aaa","apiName":"bbb","tenantId":"555","responseStatus":"FAILED","responseMessage":"Data is not processed into otlr_tbl. There is some abruption in model processing : **Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0\n"**,"payloadDetails":{"totalRecordCount":"NULL","outlierTableNm":"NULL","filters":[{"name":"NULL","dataType":"NULL","value":"NULL"}]}}

These errors does not correlate to my src,

When the triggered transactions are lost in the future call, event hub is not having any messages at all (it doesnot reach eventhub).

Scenario 2 : When multisession used,

[12:15:20.825] MultisessionFuture started
[12:15:20.826] plan(): Setting new future strategy stack:
[12:15:20.826] List of future strategies:
[12:15:20.826] 1. sequential:
[12:15:20.826]    - args: function (..., envir = parent.frame())
[12:15:20.826]    - tweaked: FALSE
[12:15:20.826]    - call: future::plan(sequential)
[12:15:20.857] plan(): nbrOfWorkers() = 1
2023-03-29 12:15:29.157125: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-11-openjdk-amd64/lib/server:/usr/local/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-11-openjdk-amd64/lib/server
2023-03-29 12:15:29.157159: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:SavedModel saved prior to TF 2.5 detected when loading Keras model. Please ensure that you are saving the model with model.save() or tf.keras.models.save_model(), *NOT* tf.saved_model.save(). To confirm, there should be a file named "keras_metadata.pb" in the SavedModel directory.
2023-03-29 12:15:30.848230: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-11-openjdk-amd64/lib/server:/usr/local/lib/R/lib:/usr/local/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/java-11-openjdk-amd64/lib/server
2023-03-29 12:15:30.848269: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-29 12:15:30.848305: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-compliance-c57bf979d-g5zjn): /proc/driver/nvidia/version does not exist
2023-03-29 12:15:30.848731: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

event message produced for above trigger was :

{"id":"221133","cdrExecId":"223311","studyId":"aaa","apiName":"bbb","tenantId":"555","responseStatus":"FAILED","responseMessage":"Data is not processed into otlr_tbl. There is some abruption in model processing : Error in x[[2]]: subscript out of bounds\n","payloadDetails":{"totalRecordCount":"NULL","outlierTableNm":"NULL","filters":[{"name":"NULL","dataType":"NULL","value":"NULL"}]}}

Reproduce example

print("message to DB for reading data") library(future) future::plan(multicore, workers = availableCores()) print("printing availableCores") print(availableCores()) print("ps cpu count") print(ps::ps_cpu_count())

options(future.debug = TRUE) options(future.globals.maxSize = 768*1024^2) ## setting up global vars memory size-can change according i/p & o/p limit options(future.seed = TRUE) print("future started - data read calling") source("dbReadData.R", local = T) future(dataRead(lifecycle_name), lazy = FALSE, globals = TRUE) future::plan(sequential) ## to kill the multicore forked processes

Expected behavior

When only one transaction is triggered at a time : following logs are given as expected

[14:11:03.637] requestCore(): workers = 12
[14:11:03.775] MulticoreFuture started
[14:11:03.778] plan(): Setting new future strategy stack:
[14:11:03.779] List of future strategies:
[14:11:03.779] 1. sequential:
[14:11:03.779]    - args: function (..., envir = parent.frame())
[14:11:03.779]    - tweaked: FALSE
[14:11:03.779]    - call: future::plan(sequential)
[14:11:03.783] plan(): Setting new future strategy stack:
[14:11:03.785] List of future strategies:
[14:11:03.785] 1. sequential:
[14:11:03.785]    - args: function (..., envir = parent.frame())
[14:11:03.785]    - tweaked: FALSE
[14:11:03.785]    - call: NULL
[14:11:03.789] plan(): nbrOfWorkers() = 1
[14:11:03.820] plan(): nbrOfWorkers() = 1

event hub messages during success scenario :-

{"id":"2222","cdrExecId":"4444","studyId":"aaa","apiName":"bbb","tenantId":"555","responseStatus":"success","responseMessage":"Data successfully processed into otlr_tbl","payloadDetails":{"totalRecordCount":2161,"outlierTableNm":"qqq","filters":[{"name":"metric_dt","dataType":"timestamp","value":"2023-03-29"}]}}

Session information R Session Info :-

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] Rfast_2.0.1            RcppZiggurat_0.1.6     Rcpp_1.0.6
 [4] tensorflow_2.5.0       keras_2.4.0            RecordLinkage_0.4-12.1
 [7] ff_4.0.4               bit_4.0.3              RSQLite_2.2.3
[10] stopwords_2.2          hunspell_3.0.1         udpipe_0.8.5
[13] tidytext_0.3.0         tidyr_1.1.2            dplyr_1.0.4
[16] reticulate_1.16        odbc_1.3.2             rJava_0.9-13
[19] DBI_1.1.1              future_1.21.0          stringr_1.4.0
[22] data.table_1.13.6      stringi_1.5.3          rjson_0.2.20
[25] glue_1.4.2             tryCatchLog_1.2.1      futile.logger_1.4.3
[28] plumber_1.0.0

loaded via a namespace (and not attached):
 [1] bit64_4.0.4          jsonlite_1.7.2       splines_4.0.3
 [4] prodlim_2019.11.13   assertthat_0.2.1     blob_1.2.1
 [7] globals_0.14.0       ipred_0.9-9          pillar_1.4.7
[10] lattice_0.20-41      digest_0.6.27        promises_1.1.1
[13] httpuv_1.5.5         Matrix_1.2-18        pkgconfig_2.0.3
[16] listenv_0.8.0        xtable_1.8-4         purrr_0.3.4
[19] whisker_0.4          later_1.1.0.1        lava_1.6.8.1
[22] tibble_3.0.6         generics_0.1.0       ellipsis_0.3.1
[25] swagger_3.33.1       cachem_1.0.3         nnet_7.3-14
[28] ada_2.0-5            survival_3.2-7       magrittr_2.0.1
[31] crayon_1.4.1         memoise_2.0.0        ps_1.5.0
[34] tokenizers_0.2.1     janeaustenr_0.1.5    parallelly_1.23.0
[37] SnowballC_0.7.0      MASS_7.3-53          class_7.3-17
[40] tools_4.0.3          hms_1.0.0            formatR_1.7
[43] lifecycle_0.2.0      lambda.r_1.2.4       compiler_4.0.3
[46] e1071_1.7-4          evd_2.3-3            rlang_0.4.10
[49] grid_4.0.3           webutils_1.1         rstudioapi_0.13
[52] rappdirs_0.3.1       base64enc_0.1-3      codetools_0.2-16
[55] R6_2.5.0             tfruns_1.4           zeallot_0.1.0
[58] fastmap_1.1.0        futile.options_1.0.1 vctrs_0.3.6
[61] rpart_4.1-15         tidyselect_1.1.0
python:         /usr/bin/python3
libpython:      /usr/lib/python3.8/config-3.8-x86_64-linux-gnu/libpython3.8.so
pythonhome:     //usr://usr
version:        3.8.10 (default, Mar 13 2023, 10:26:41)  [GCC 9.4.0]
numpy:          /usr/local/lib/python3.8/dist-packages/numpy
numpy_version:  1.19.5

python versions found:
 /usr/bin/python3
 /usr/local/bin/python