Running ert with >= 200 realizations fails

oyvindeide commented 1 year ago

Describe the bug A user has reported [INTERNAL LINK] that ert fails in the start of iter-1. Iter-0 completes successfully, and the analysis step seems to run fine. The ensemble setup ran fine with 50 realizations, but 200 did not work, nor did 100. At first it was suspected that they ran out of memory, but that seems not to be the case.

Fails here: https://github.com/equinor/ert/blob/e21998046cc85d6dfbcc4eb0c3fa5c370b151a71/src/ert/_c_wrappers/job_queue/queue.py#L471

Environment Note! Using komodo 2022.11.02, but the queue interaction should not be that different from main.

valentin-krasontovitsch commented 1 year ago

local trial - summary

after some advice from Øyvind i found the poly template and used a script to generate lots of data, and ran esmda locally on my machine. no problem, went from itereation 0 to iteration 3.

local trial - details

on ert main prepare a big case

cd tests/performance_tests
python3 performance_utils.py

copy case over to local ert checkout, to have control over files and add to git we assume to be in ert dir

cp /tmp/[TEMP_DIR]/poly test-data/big-poly-case

all config modifications are done in poly.ert, see branch for details

added dummy file with silly text content
added a bunch of forward models copying the file to new filename
remove the max running queue option
run esmda, on local machine, with local queue

branch with big case on my fork

valentin-krasontovitsch commented 1 year ago

rgs node trial - details

checked out above setup, changed queue system to LSF, and started by running

ert es_mda --realizations 0-199 --target-case default_%d --weights '4,2,1' poly.ert

(i prefer putty over the gui)

hnformentin commented 1 year ago

this could be connected to https://github.com/equinor/ert/issues/4708

valentin-krasontovitsch commented 1 year ago

branch on my fork for reproducing the issue: https://github.com/valentin-krasontovitsch/ert/tree/test-200-realizations ping me here or on slack in case you wonder if the latest changes have been pushed!

valentin-krasontovitsch commented 1 year ago

so after realizing that the local reproduction was due to moving the laptop and changing the wifi controller's ip address, we were kind of back to square one, but learned a lot about the websocket communication in ert.

now we are trying to

reproduce the initial user's reported case
reproduce using the drogon, with an increased number of realizations (300)
check out another user report made on yammer [INTERNAL LINK]

drogon setup

we copied the latest release of drogon [INTERNAL LINK], grabbing the tar archive from the repo and unpacking it locally on an rgs node.

we're gonna run the drogon_ahm.ert config file, using the ESMDA algorithm.

to increase the number of realizations, we had to

specify the number NUM_REALIZATIONS 300 in the config file
modify the design matrix, which is located in ../input/distributions/design_matrix_ahm.xlsx, adding ca. 200 more rows by
- increasing the left most column (realization number) by one in each row
- copy the next two columns (SENSNAME, SENSCASE) exactly (constant value in each row)
- copying the rightmost columns after the RMS SEED as a block and pasting it twice, and
- increase the RMS SEED by one in each ROW (continuing the existing pattern), BUT skip number 1112 (it caused RMS to fail, only GOD aka haavard knows why) - i continued with 1124 after 1111

finally just run (for command line execution):

ert es_mda --realizations 0-299 --target-case default_%d --weights '4,2,1' drogon_ahm.ert

or start the gui and do the thing (esmda default options) there.

observations of symptoms

what we see at the end of the ert logs (ert-... in the logs folder next to the config file), in this case from drogon:

2023-05-22 10:15:04,987 - res.job_queue - LegacyEnsemble - DEBUG - Set DROGON-298(298) to JOB_QUEUE_FAILED
2023-05-22 10:15:04,998 - res.job_queue - LegacyEnsemble - DEBUG - Set DROGON-299(299) to JOB_QUEUE_FAILED
2023-05-22 10:17:20,990 - root - MainThread - DEBUG - Resetting environment variable ERT_SITE_CONFIG
2023-05-22 10:17:20,991 - root - MainThread - DEBUG - Deleting temporary directory for site-config
2023-05-22 10:17:20,992 - ert.__main__ - MainThread - ERROR - Simulation failed! All realizations failed!
Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Realization: 122 failed after reaching max submit with:
Realization: 145 failed after reaching max submit with:
Realization: 161 failed after reaching max submit with:
Realization: 25 failed after reaching max submit with:
Realization: 41 failed after reaching max submit with:
unexpected exception in queue
Traceback (most recent call last):
  File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/__main__.py", line 602, in main
    args.func(args, context.plugin_manager)
  File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/cli/main.py", line 169, in run_cli
    raise ErtCliError(model.getFailMessage())
ert.cli.main.ErtCliError: Simulation failed! All realizations failed!
Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Realization: 122 failed after reaching max submit with:
Realization: 145 failed after reaching max submit with:
Realization: 161 failed after reaching max submit with:
Realization: 25 failed after reaching max submit with:
Realization: 41 failed after reaching max submit with:
unexpected exception in queue
2023-05-22 10:17:20,998 - root - MainThread - INFO - Peak memory use: 1490276 kB

this happens after trying to kill all (pending / running / waiting / i dunno) jobs, while the original exception typically occurs a bit earlier, like such:

2023-05-22 10:14:46,928 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - ERROR - Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Traceback (most recent call last):
  File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 390, in _publish_changes
    async with connect(
  File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
    return await self
  File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
    return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
2023-05-22 10:14:46,950 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - ERROR - unexpected exception in queue
Traceback (most recent call last):
  File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 454, in execute_queue_via_websockets
    await JobQueue._publish_changes(
  File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 390, in _publish_changes
    async with connect(
  File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
    return await self
  File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
    return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

after trying to retry to connect 10 times, with certain backup as programmed by us into this our queue component.

checking the logs of the ensemble evaluator (ee) around the timestamps of the job queue not being able to connect to the ee websockets (ws) server, we see a pattern of the ee's "queue" running full, or at least getting busy, like such (resulting of running grep -e 'processed' ee-log...):

2023-05-22 10:09:22,075 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 15 events in 0.000029s. 0 left in queue
2023-05-22 10:09:31,989 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 22 events in 0.000033s. 0 left in queue
2023-05-22 10:09:44,877 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 996 events in 0.000561s. 0 left in queue
2023-05-22 10:09:53,879 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 39 events in 0.000040s. 0 left in queue
2023-05-22 10:10:05,196 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 167 events in 0.000113s. 0 left in queue
2023-05-22 10:10:16,641 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 30 events in 0.000021s. 0 left in queue
2023-05-22 10:10:24,468 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004492s. 1224 left in queue
2023-05-22 10:10:41,075 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000685s. 9404 left in queue
2023-05-22 10:10:52,382 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000919s. 8780 left in queue
2023-05-22 10:11:00,005 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.014857s. 9318 left in queue
2023-05-22 10:11:20,321 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004537s. 12905 left in queue
2023-05-22 10:11:24,130 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.013159s. 11981 left in queue
2023-05-22 10:11:34,379 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004296s. 11158 left in queue
2023-05-22 10:11:52,142 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001561s. 10764 left in queue
2023-05-22 10:12:03,758 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004955s. 10111 left in queue
2023-05-22 10:12:20,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.013619s. 9434 left in queue
2023-05-22 10:12:35,700 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.003865s. 8757 left in queue
2023-05-22 10:12:50,083 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001116s. 7983 left in queue
2023-05-22 10:13:05,309 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.014509s. 7308 left in queue
2023-05-22 10:13:16,903 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.012791s. 6541 left in queue
2023-05-22 10:13:32,172 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000586s. 5804 left in queue
2023-05-22 10:13:45,588 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001132s. 5106 left in queue
2023-05-22 10:14:00,905 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001094s. 13375 left in queue
2023-05-22 10:14:09,699 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001778s. 13253 left in queue
2023-05-22 10:14:28,020 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.008381s. 14137 left in queue
2023-05-22 10:14:37,003 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001520s. 13688 left in queue
2023-05-22 10:15:06,004 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.003402s. 13087 left in queue
2023-05-22 10:15:13,608 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.002020s. 12604 left in queue
2023-05-22 10:15:20,765 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001408s. 12035 left in queue
2023-05-22 10:15:25,816 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001443s. 11035 left in queue
2023-05-22 10:15:31,057 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001367s. 10035 left in queue
2023-05-22 10:15:36,123 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001199s. 9035 left in queue
2023-05-22 10:15:41,267 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001048s. 8035 left in queue
2023-05-22 10:15:46,206 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000992s. 7035 left in queue
2023-05-22 10:15:51,013 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000887s. 6035 left in queue
2023-05-22 10:15:57,388 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000880s. 5035 left in queue
2023-05-22 10:16:02,627 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000696s. 4035 left in queue
2023-05-22 10:16:07,768 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000719s. 3035 left in queue
2023-05-22 10:16:14,433 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000349s. 2035 left in queue
2023-05-22 10:16:19,399 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000595s. 1035 left in queue
2023-05-22 10:16:25,038 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000469s. 35 left in queue
2023-05-22 10:16:30,131 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 35 events in 0.000038s. 0 left in queue
2023-05-22 10:16:32,474 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 0 events in 0.000015s. 0 left in queue
2023-05-22 10:16:32,479 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 0 events in 0.000008s. 0 left in queue

This correlation was very clear for the drogon case, in the sense that the ee's queue got busy / full exactly when the job_queue could not connect to the ee ws server anymore.

For the first user report, the case was less clear - the ee queue got busier when the job_queue could not reconnect, but not full to the same degree as above.

An inspection of the logs connected to the user report on yammer also shows a correlation, in the sense that the ee queue runs full when the job queue cannot connect anymore, but the ee queue also get's very full (even fuller than when the job_queue fails), and the job_queue reports no trouble...

valentin-krasontovitsch commented 1 year ago

reproducing the case originally reported

get the model

create a personal dir (no data should leave the assets environment):

mkdir /project/johan_sverdrup/resmod/ff/users/[YOUR_USERNAME]

the latest model can be found in

/project/johan_sverdrup/resmod/ff/users/tomal/23.0.1_16022023_surface2_poly_newrms

copy the whole thing over, and you may find the ert config file we've been using in (relative paths now)

ert/model/newrms_23p0p1_history.ert

fix the model

there is one little script that needs updating - in case your copy didn't have the update, here it is - apply this diff in your model in rms/bin/js_fmu_rms_scripts:

diff --git a/python/compress_grdecl.py b/python/compress_grdecl.py
index b5017eb..d4585bc 100644
--- a/python/compress_grdecl.py
+++ b/python/compress_grdecl.py
@@ -79,7 +79,14 @@ def main(files, keeporiginal=False, dryrun=False):

                 if not keeporiginal:
                     os.remove(filename + ".orig")
-
+        with open (filename, 'rt') as f:
+            data = f.read()
+            data = data.replace('METRES 3*', 'METRES')
+            f.close()
+        with open (filename, 'wt') as f:
+            f.write(data)
+            f.close()
+            
 def chunks(l, n):
     """Yield successive n-sized chunks as strings from list l."""
     for i in range(0, len(l), n):

setup komodo

source stable
create komodoenv
get local checkout of ert version / branch you wanna use
install it python3 -m pip install . (run from source folder of ert repo)
you MIGHT have to install main version of semeio as well, there was an issue when i started working on it, and mind to install it using pip install --no-deps . from a local checkout

run it

the scratch path will be /scratch/johan_sverdrup/[YOUR_USERNAME], so no need to modify anything there. start it with ert gui newrms_23p0p1_history.ert, and run two pre simulation workflows:

click on "Run workflow" in the top meny in the gui
choose the Workflow "HISTORY_presim_disksetup", click Start workflow, wait for it finish (shouldn't take long, about 1 minute)
repeat the last step for "HISTORY_presim_make_gen_kw_input", also shouldn't take long
leave the run workflow dialog, choose ESMDA, leave all parameters as is (except maybe the target case name, in case you're rerunning without having cleared out storage, not sure if necessary), and run it!

for rerunning, you might want to rename the /scratch/johan_sverdrup/[USERNAME]/[CASENAME] modifying the case name (removing takes a long time, and you might want to hold on to the logs in the run path)

valentin-krasontovitsch commented 1 year ago

ran said model with a branch based on somewhat current main, with several fixes included (timeouts increased to 60 seconds, job queue is reusing websockets connection instead of creating new connection every time it publishes (possibly empty list of) changes), starting may 31st.

log files are in /project/johan_sverdrup/resmod/ff/users/vkra/23.0.1_16022023_surface2_poly_newrms/ert/model/logs with timestamp 2023-05-31T1427 scratch has been moved to /scratch/johan_sverdrup/vkra/23p0p1_new_rms_test_2023-05-31

we have the following situation:

ert-log

no particular exceptions
some job queue connection trial timeouts, but finally manages to connect as we have another log "we are in _publish_changes"

valentin-krasontovitsch commented 1 year ago

another thread / idea to follow is that one user is running a setup with 300 realizations successfully, using komodo 2022.11.rc7. could check the diff between that version and 2022.11.02 as per initial report...

valentin-krasontovitsch commented 1 year ago

TODO:

finish comment from 5 days ago

describe work-around for RMS update (see teams! - look at rms/bin/js_fmu_rms_scripts): apply the following diff in your model, in the above mentioned folder:


diff --git a/python/compress_grdecl.py b/python/compress_grdecl.py
index b5017eb..d4585bc 100644
--- a/python/compress_grdecl.py
+++ b/python/compress_grdecl.py
@@ -79,7 +79,14 @@ def main(files, keeporiginal=False, dryrun=False):

             if not keeporiginal:
                 os.remove(filename + ".orig")

with open (filename, 'rt') as f:
data = f.read()
data = data.replace('METRES 3*', 'METRES')
f.close()
with open (filename, 'wt') as f:
f.write(data)
f.close()
def chunks(l, n): """Yield successive n-sized chunks as strings from list l.""" for i in range(0, len(l), n):
describe latest findings when using branch test-threading-sleep-with-poly on valentin-krasontovitsch fork that makes the job queue node threads sleep longer

kvashchuka commented 1 year ago

comparison between ert versions in mentioned komodo releases https://github.com/equinor/ert/compare/4.1.0-rc3...4.1.4

xjules commented 1 year ago

NB: When the jobs are only pending we need to ssh into the compute cluster first, which in a default case in Stavanger would be: /usr/bin/ssh st-grid03.st.statoil.no

xjules commented 1 year ago

Updating the issue with the note from @valentin-krasontovitsch about when we got the heavy event processing:

2023-06-01 20:36:36,539 - ert.ensemble_evaluator.dispatch - processed 31 events in 4.172715s. 88 left in queue
2023-06-01 20:37:04,060 - ert.ensemble_evaluator.dispatch - processed 168 events in 25.486581s. 199 left in queue
2023-06-01 20:40:14,327 - ert.ensemble_evaluator.dispatch - processed 199 events in 148.126047s. 9526 left in queue
2023-06-01 20:47:51,125 - ert.ensemble_evaluator.dispatch - processed 1000 events in 446.345663s. 30881 left in queue
2023-06-01 20:49:33,615 - ert.ensemble_evaluator.dispatch - processed 1000 events in 28.171717s. 38835 left in queue
2023-06-01 20:55:31,811 - ert.ensemble_evaluator.dispatch - processed 1000 events in 210.772313s. 58862 left in queue
2023-06-01 20:58:12,007 - ert.ensemble_evaluator.dispatch - processed 1000 events in 149.863073s. 57862 left in queue

berland commented 1 year ago

This problem can be reproduced on the poly_example with something along the following:

QUEUE_OPTION LOCAL MAX_RUNNING 500 NUM_REALIZATIONS 500

and repeat SIMULATION_JOB poly_eval maybe 50-100 times.

in poly_eval.py add import time and time.sleep(1) at the end of main.

xjules commented 1 year ago

When reproducing the poly_example with valentin/test-threading-sleep-with-poly branch It seems that all went well 👍 No timeout errors.

Nevertheless, the GUI was extremely not responsive during the actual run. We might need to look at that.

valentin-krasontovitsch commented 1 year ago

sadly the fixes were not enough, reopening

valentin-krasontovitsch commented 1 year ago

latest test run of user model with "fixes" of

introducing sleep in job runner in an attempt to reduce number of events
letting the gui reconnect indefinitely instead of taking it down

finished iter-0, and then the gui was basically hanging. while the server was unresponsive, because very busy dealing with messages from other clients, the gui was trying to reconnect. previously, it failed after three unsuccessful attempts. the "fix" we implemented yesterday was to not let the gui fail, but let it retry indefinitely. the result is that it does not fail (yeahy), but while it's not connected, it also doesn't receive the messages it should get from the server (status updates about forward models and realizations) - once it reconnects, it waits for messages, but everything is finished (which it doesn't know), and it stays spinning, waiting for updates that never come. that is at least my current understanding.

several ideas:

somehow make sure that the server waits for the gui to be connected when it tries to publish changes (at the moment, it sends out changes to all connected clients, which can be none)
don't drop the connection when the server gets busy, remove the "ping-pong" mechanism that detects a bad connection and times out currently (might have unintended side effects when the connection is bad for other reasons)
investigate what the frick is creating so many messages, and / or why it's taking so long to process them

valentin-krasontovitsch commented 1 year ago

exploring the idea of disabling the ping pong mechanism on the client side.

testing locally with a simple client and server script, we see that

setting ping_timeout to None in the client's connect call stops the client from closing the connection on timed out pings - according to the docs, ping pongs are still sent, we just don't care about timeouts, so the TCP connection is still kept alive, hopefully:

Once the connection is open, a Ping_ frame is sent every ping_interval seconds. This serves as a keepalive. It helps keeping the connection
open, especially in the presence of proxies with short timeouts on
inactive connections. Set ping_interval to :obj:None to disable
this behavior.

.. _Ping: https://www.rfc-editor.org/rfc/rfc6455.html#section-5.5.2

If the corresponding Pong_ frame isn't received within ping_timeout
seconds, the connection is considered unusable and is closed with code

This ensures that the remote endpoint remains responsive. Set
ping_timeout to :obj:None to disable this behavior.

.. _Pong: https://www.rfc-editor.org/rfc/rfc6455.html#section-5.5.3

if the server shuts down, the client keeps trying to reconnect even though it's getting a ConnectionRefused error because nothing is listening on the port anymore. that's potentially bad news for us, as previous trials seemed to indicate that the client fails immediately if nothing is reachable under the given address. should explore undesirable states and figure out if it's a problem or not
only disabling the ping pong timeout on the client side means the server is still ping ponging, and while this might not apply to our scenario, as far as we know / have observed so far, if the client should become unresponsive or busy, the server's ping pong might time out, and the server will close the connection to the client. at least for testing, we will disable the server's ping pong timeout as well, as we have the client waiting here and there to simulate long running stuff

not wanting to waste much time on simulating things in silly scripts, i proceeded to try and run the beefed up poly example again, with main + gui reconnects indefinitely + job runner waits (sleeps) a bit with pushing events after initial no-sleep time.

locally, on my own machine, using the local queue, and running with MAX_RUNNING 14 to not overstrain the meager 16 cores, the results were not satisfactory - it just ran, processing times were not immense, and i stopped it after a short while.

but on an RGS node, using the LSF queue and no limitation on how many realizations should run concurrently, event processing times went through the roof, and the gui soon disconnected. nice. reproduction of the user case achieved.

disabling all ping timeout sadly did not fix the situation. somehow, connections are still getting dropped. i cannot discern how and why the connection gets dropped, but the queue soon drops into a reconnect loop:

websockets.client - LegacyEnsemble - INFO - ! connect failed again; retrying in 55 seconds
Traceback (most recent call last):
  File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 607, in __aiter__
    async with self as protocol:
  File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
    return await self
  File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
    return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

i'm also seeing a lot of bad request logs

websockets.server - ert_ee_run_server - INFO - connection failed (400 Bad Request)

to dig deeper here, one might turn on more debug logging, and perhaps repeat the experiment but with just one realization, to reduce the amount of noise / data, while inserting an artifical sleep into the event processing to simulate a long processing time. this could help us understand how and why connections with the various clients are behaving.

regardless, the idea of just disabling the timeouts does not seem good enough.

EDIT: looking a bit more at the still running gui, it got updates - from the job runners. it's just the job queue that drops the connection - probably because it's trying to send! and the server is not responsive / ready to receive...

valentin-krasontovitsch commented 1 year ago

after running a bit more and experimenting a bit more with this and that, and finally starting to profile stuff (yappi, for easy multi-threaded profiling), we have come to the conclusion that the way forward is one of the following possiblities:

switch to using protobuffs

this would solve like almost all our issues immediately (maybe hopefully), but

part of the implementation is missing (communication ee-server -> gui)
extensive change, dangerous, needs extensive testing

we decide to not go for this for now

continue applying duct-tape

we try to increase some more sleeps, decrease processing buffer, and maybe implement a pattern where the ee server waits for the gui to reconnect, and then hope for the best (with possible delays in the gui / for short running stuff)

reimplement recursive update

profiling seems to show (i literally have no idea what i'm doing and whether i did it right) that recursive update takes a lot of time.

we could investigate what's up with that, in the particular use case of forward model events, and see if we can find a faster implementation.

could also add some tests for making sure it works and for benchmarking the speed.

valentin-krasontovitsch commented 1 year ago

re: duct-tape.

we

reduced the batch-size of the batching dispatcher to 100 (avoid long events processing times)
added a simple sleep of five seconds after every forward model status update in the job runner (relieve message pressure, slow things down at the source)
increase sleep in job_queue from 1 to 30 seconds (relieve thread switching load)
added five seconds sleep in evaluator tracker (slow down updates in the gui - not sure of the intention here anymore, now that i'm writing it, maybe also to relieve thread switching stress)

cf changes.diff.txt

this lead to

reduced events processing times: only 15 two-digit durations, between 10 and 20 seconds, and one 105 seconds duration
after 1.5 hours, the first iteration of an esmda run of big_poly_case still had not finished, even though all realizations had finished
according to the GUI, all realizations had state running
at the time of sending ctrl-c, there were still over 20k events left for the ee server to process and propagate to the gui's tracker
there was but one timeout in those 1.5 hours reported by the ee server
no other errors were reported

overall, I think that our adjustment maybe was good - went in the right direction, this was after all the (hopefully) somewhat crazy artificial case with 200 realizations and 200 silly fast-running forward models - but perhaps too conservative.

gonna try a more dynamic approach next, we're we any time the events processing time goes >= 10 seconds, we reduce the batch size and increase the sleep (of the batching dispatcher), reverting if processing time goes down.

valentin-krasontovitsch commented 1 year ago

dynamic.diff.txt

this seems to do some tricks - events are not lagging behind anymore, they're trickling in and getting processed in due time. no runreasonable processing times, a handful above 10 (top: 35).

yet it seems that execution of forward models is delayed on compute nodes: after 30 minutes, most realizations have proceeded to about forward model number 115. keep in mind that each of the first ca. 200 forward models is just a copy file job, with the source being a short text file.

perhaps the sleep in the job runner can be shortened or removed.

this feels like fiddling with something, and possibly making a brittle solution... user model tests will show more.

EDIT: when all realizations of the first iteration are finally done, we get a bit of a storm again, with a few timeout. the expected and yet annoying thing is the delay in the gui / tracker though - here, instead of every five seconds, the next update (i'm using CLI) is printed after minutes.

so after one hour, all realizations are finished, but at one hour 20 minutes, the cli is still saying all realizations are running...

perhaps remove the sleep in the evaluator tracker :grimacing: also want to try and remove / reduce the sleep in the job runner. let the batching dispatcher be our dynamic bottleneck, the gatekeeper, if you will.

EDIT EDIT: next morning:

gui (really the cli) still stuck on iteration 0, with all 200 realizations running

the job queue, after logging a bunch of lines (200, for all realizations) saying

2023-06-27 23:48:32,555 - res.job_queue - Thread-174 - DEBUG - Set POLY_SUMMARY(170) to JOB_QUEUE_DONE

has only started to log some lines of

2023-06-28 04:05:35,280 - res.job_queue - Thread-7 - DEBUG - Set POLY_SUMMARY(3) to JOB_QUEUE_SUCCESS

and

2023-06-28 04:05:35,366 - ert._c_wrappers.job_queue.job_queue_node - Thread-18 - INFO - Realization: 14 complete, starting to load results

which is strange.

the whole last block of logs I see at the moment is

...
2023-06-27 23:48:31,966 - res.job_queue - Thread-169 - DEBUG - Set POLY_SUMMARY(165) to JOB_QUEUE_DONE
2023-06-27 23:48:32,555 - res.job_queue - Thread-174 - DEBUG - Set POLY_SUMMARY(170) to JOB_QUEUE_DONE
2023-06-27 23:50:03,919 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - WARNING - job queue dropped connection to ensemble evaulator - going to try and reconnect
2023-06-28 04:05:35,280 - res.job_queue - Thread-7 - DEBUG - Set POLY_SUMMARY(3) to JOB_QUEUE_SUCCESS
2023-06-28 04:05:35,366 - ert._c_wrappers.job_queue.job_queue_node - Thread-18 - INFO - Realization: 14 complete, starting to load results
2023-06-28 04:05:52,701 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - WARNING - job queue dropped connection to ensemble evaulator - going to try and reconnect
2023-06-28 04:05:55,094 - res.job_queue - Thread-16 - DEBUG - Set POLY_SUMMARY(12) to JOB_QUEUE_SUCCESS
2023-06-28 04:05:55,388 - ert._c_wrappers.job_queue.job_queue_node - Thread-21 - INFO - Realization: 17 complete, starting to load results
2023-06-28 04:07:16,190 - res.job_queue - Thread-12 - DEBUG - Set POLY_SUMMARY(8) to JOB_QUEUE_SUCCESS
2023-06-28 04:07:16,581 - ert._c_wrappers.job_queue.job_queue_node - Thread-43 - INFO - Realization: 39 complete, starting to load results
2023-06-28 04:07:27,172 - res.job_queue - Thread-15 - DEBUG - Set POLY_SUMMARY(11) to JOB_QUEUE_SUCCESS
2023-06-28 04:07:27,226 - ert._c_wrappers.job_queue.job_queue_node - Thread-9 - INFO - Realization: 5 complete, starting to load results
2023-06-28 04:08:00,392 - res.job_queue - Thread-10 - DEBUG - Set POLY_SUMMARY(6) to JOB_QUEUE_SUCCESS
2023-06-28 04:08:00,546 - ert._c_wrappers.job_queue.job_queue_node - Thread-35 - INFO - Realization: 31 complete, starting to load results
2023-06-28 04:08:49,773 - res.job_queue - Thread-11 - DEBUG - Set POLY_SUMMARY(7) to JOB_QUEUE_SUCCESS
2023-06-28 04:08:49,780 - ert._c_wrappers.job_queue.job_queue_node - Thread-39 - INFO - Realization: 35 complete, starting to load results
2023-06-28 04:10:36,825 - res.job_queue - Thread-25 - DEBUG - Set POLY_SUMMARY(21) to JOB_QUEUE_SUCCESS
2023-06-28 04:10:36,832 - ert._c_wrappers.job_queue.job_queue_node - Thread-22 - INFO - Realization: 18 complete, starting to load results
2023-06-28 04:11:13,892 - res.job_queue - Thread-23 - DEBUG - Set POLY_SUMMARY(19) to JOB_QUEUE_SUCCESS
2023-06-28 04:11:14,395 - ert._c_wrappers.job_queue.job_queue_node - Thread-4 - INFO - Realization: 0 complete, starting to load results
2023-06-28 04:11:47,510 - res.job_queue - Thread-17 - DEBUG - Set POLY_SUMMARY(13) to JOB_QUEUE_SUCCESS
2023-06-28 04:11:47,534 - ert._c_wrappers.job_queue.job_queue_node - Thread-28 - INFO - Realization: 24 complete, starting to load results
2023-06-28 04:12:00,873 - res.job_queue - Thread-5 - DEBUG - Set POLY_SUMMARY(1) to JOB_QUEUE_SUCCESS
2023-06-28 04:12:01,150 - ert._c_wrappers.job_queue.job_queue_node - Thread-24 - INFO - Realization: 20 complete, starting to load results

i should also say that there were already yesterday a few connection errors, but i thought that they were not significant - here's the logs:

... [the line below repeated many times, i think this was all the realizations / job runners coming to a close]
2023-06-27 23:48:18,348 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from _ert_job_runner.client done!
2023-06-27 23:48:18,349 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 100 events in 6.735008s. 1007 left in queue
2023-06-27 23:48:18,455 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from _ert_job_runner.client done!
2023-06-27 23:50:03,935 - ert.ensemble_evaluator.evaluator - ert_cli_simulation_thread - DEBUG - Connection closed unexpectedly in run_and_get_successful_realizations: sent 1011 (unexpected error) keepalive ping t
2023-06-27 23:50:04,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 103.618057s. 35 left in queue
2023-06-27 23:50:04,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-27 23:50:04,251 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection closed error: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
2023-06-27 23:50:04,252 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:50:08,859 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - ERROR - a dispatcher abruptly closed a websocket: received 1011 (unexpected error) keepalive ping timeout; then sent 1011 (unexpecte
2023-06-27 23:50:08,859 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from ert._c_wrappers.job_queue.queue done!
2023-06-27 23:50:09,667 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /dispatch from ert._c_wrappers.job_queue.queue
2023-06-27 23:50:19,673 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /client from ert.ensemble_evaluator.monitor
2023-06-27 23:51:07,276 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:51:07,277 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:52:11,094 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:52:11,095 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:53:14,338 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:53:14,339 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:54:00,126 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 100 events in 137.251784s. 43 left in queue
2023-06-27 23:54:00,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-27 23:54:01,816 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - ERROR - a dispatcher abruptly closed a websocket: received 1011 (unexpected error) keepalive ping timeout; then sent 1011 (unexpecte
2023-06-27 23:54:01,817 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from ert._c_wrappers.job_queue.queue done!
2023-06-27 23:54:11,718 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /client from ert.ensemble_evaluator.monitor
2023-06-27 23:54:12,150 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connected
2023-06-27 23:56:46,759 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 43 events in 57.427320s. 0 left in queue
2023-06-27 23:56:46,759 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-28 04:05:53,449 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /dispatch from ert._c_wrappers.job_queue.queue
2023-06-28 04:06:38,553 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1 events in 2.892798s. 0 left in queue
2023-06-28 04:07:36,981 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 2 events in 3.770428s. 0 left in queue

interpretation: maybe the gui missed something while it got disconnected maybe the JobQueue is not actually done yet, as the JobQueueNodes don't seem to be done yet, as they still have to run what i think is in the callbacks, i.e. load results (or is that a different part of the code? not sure, haven't looked at the details yet) - in that case the gui is correctly still waiting for iteration 0 to finish. but why would the job queue (nodes) be lagging so much!?

EDIT: After further investigation: from job_queue_status.cpp:

JOB_QUEUE_DONE; // Job is done (successful or not), temporary state - controlled/returned by by queue_driver JOB_QUEUE_SUCCESS; // All good, comes after JOB_QUEUE_DONE, with additional checks, FINAL STATE - controlled by job_queue

and looking a bit further, indeed, the JobQueueNode (cf. job_queue_node.py) logs starting to load results and calls the callback after we have JOB_QUEUE_DONE; after that, we may get JOB_QUEUE_SUCCESS, and finally ThreadStatus.DONE for the JobQueueNode. Once all JobQueueNodes have ThreadStatus.DONE, the JobQueue will switch state to not is_active, and finish

but why is there so much time between the C part of JobQueueNode logging Set POLY_SUMMARY(170) to JOB_QUEUE_DONE and the python part noticing this, and going on to the callback???

valentin-krasontovitsch commented 1 year ago

to investigate the last question, i've

removed the sleep in the job runner
removed the sleep in the evaluator (gui)
added some debug log about the job queue node status

and run the big poly case again.

this is all btw happening on the branch detailed-study-many-realz on my fork...

valentin-krasontovitsch commented 1 year ago

so after a bit of work was done mostly by @berland, which as far as we could tell massively sped up the processing that was being logged, and after fixing all (that we found) remaining bugs the came with the rewrite of the (partial) snapshot class, we had another very optimistic and hopeful test of the user model. we failed. we saw timeouts again :sob: and we saw durations of up to 900 seconds!

2023-07-15 15:55:48,417 processed 413 events in 130.765714s. 0 left in queue
2023-07-15 16:03:14,224 processed 1000 events in 317.391545s. 4517 left in queue
2023-07-15 16:07:55,578 processed 1000 events in 197.217101s. 12116 left in queue
2023-07-15 16:10:40,980 processed 1000 events in 110.467765s. 12591 left in queue
2023-07-15 16:18:08,543 processed 1000 events in 351.514885s. 11459 left in queue
2023-07-15 16:21:09,084 processed 1000 events in 112.926649s. 17759 left in queue
2023-07-15 16:26:57,042 processed 1000 events in 161.048984s. 14759 left in queue
2023-07-15 16:45:45,262 processed 1000 events in 914.408085s. 15263 left in queue
2023-07-15 17:05:02,859 processed 1000 events in 741.990857s. 37063 left in queue
2023-07-15 17:12:23,042 processed 1000 events in 149.200687s. 56336 left in queue

and finally the monitor brought down ert again

2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
  ert_gui_simulation_thread - DEBUG - \
  run_and_get_successful_realizations caught Connection \
  timeout to host wss://10.217.147.132:51839/healthcheck.\
  3 unsuccessful attempts
2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
  ert_gui_simulation_thread - DEBUG - Max connection attempts reached
2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
  ert_gui_simulation_thread - DEBUG - Cancelling current ensemble

i'm kind of at end again, and confused - seems like we've come round again to the original issue.

i'm confused, because I posted on slack that timings were awesome and whatnot - BUT it may be that we did not get good enough of a test. in the case above, we start failing around 16:01 - realization 22 (which always fails) failed at 14:56, ca. 1 hour earlier. if i remember correctly, in the previous case which i thought was triumphant, the run failed because the failure of realization 22 was not handled correctly. this means that the times we saw there where before the actual bad storm that apparently occurs later for this model...

so to ascertain my own sanity, i'm gonna run some basic tests. i'm still unsure what is going on, and why we have these long processing times in the user model when we have them. we have one storm first, starting at around 12:17, but durations don't get up 2 seconds, with 40k max left in queue. then you have the storm starting at around 16:03, with max 68k left in queue, which takes down the app and durations of up to 900 seconds. but at the same time you have outliers:

2023-07-15 15:29:13,617 processed 52 events in 69.471989s. 0 left in queue
2023-07-15 15:49:22,779 processed 138 events in 49.568064s. 0 left in queue
2023-07-15 15:50:32,318 processed 1000 events in 43.629527s. 606 left in queue

and these I cannot explain at all. This big variation in processing times, it either means that

we are processing some really complicated events, or have complicated state that we process the events in, which leads to long durations, or
there is something else churning in other threads which steals resources from the thread responsible for processing events

we're gonna try to test the last assumption, and take as a culprit "the gui", trying to run CLI and GUI and compare times.

dummy case

we run the dummy example:

config file dummy.ert given below
local dev machine
used dummy-file has content a few random string, like "hello world"

config file:

QUEUE_SYSTEM LOCAL
QUEUE_OPTION LOCAL MAX_RUNNING 300

RUNPATH output/realization-<IENS>/iter-<ITER>

NUM_REALIZATIONS 200
MIN_REALIZATIONS 1
MAX_SUBMIT 1

FORWARD_MODEL COPY_FILE(<FROM>=<CONFIG_PATH>/dummy-file,<TO>=<RUNPATH>/copy-of-dummy-file)
-- repeat above line ca. 228 times

CLI

we run with the CLI, using ert ensemble_experiment dummy.ert

our branch

we run on branch simple-snapshot-investigation-rebase of my own fork, commit 49ff6a1d

this gives us ee logs where

we have at most 60k events left in queue
all processing durations are below 1 second

main

we run on main, which at the moment means commit db9424ced0b

this gives us ee logs where

we have at most 106k events left in queue
processing times are first bach 14 seconds, then stable at 2-4 seconds while the queue builds up, then between 0 and 2 seconds while the queue is worked off and there are no more messages coming in; last batch again takes 14 seconds
finishing takes a moment even after everything is seemingly done, not sure why

GUI

we run with the gui, changing to ensemble experiment and hitting go immediately

main

we run on main, which at the moment means commit db9424ced0b

simlar to CLI run, except that during queue build-up, durations are slightly longer, between 2 and 6 seconds - after that, they stabilize below one second! total duration gui window 8 minutes, ert log file last timestamp minus first timestamp confirms 8 minutes (and 13 seconds)

our branch

we run on branch simple-snapshot-investigation-rebase of my own fork, commit 49ff6a1d

max 72k left in queue
during buildup times of 0.3-0.8 seconds
cool-down times of 0.07-0.13 seconds
no extreme times for first or last batch

total duration gui - 5 minutes 48 seconds (so there's something we're doing right), logs confirm (5 seconds more)

valentin-krasontovitsch commented 1 year ago

dummy case on RGS-onprem

run as in comment above, except on an RGS node we needed to change queue options in dummy.ert to

QUEUE_SYSTEM LSF
QUEUE_OPTION LSF MAX_RUNNING 300

BTW we always cleared storage and output before running again, but sometimes kept logs around

CLI

we run with the CLI, using ert ensemble_experiment dummy.ert

bleeding

we run with komodo bleeding (which was green, btw, awesome), which has ert main at commit 5e3164269.

this gives us ee logs where

2023-07-16 10:43:59,066 processed 1 events in 0.034413s. 0 left in queue
2023-07-16 10:44:41,281 processed 358 events in 14.002437s. 0 left in queue
2023-07-16 10:45:23,631 processed 99 events in 26.151736s. 0 left in queue
2023-07-16 10:45:59,836 processed 1000 events in 28.127329s. 44328 left in queue
2023-07-16 10:46:22,825 processed 1000 events in 20.342448s. 94801 left in queue
2023-07-16 10:46:25,912 processed 1000 events in 1.083446s. 134206 left in queue
2023-07-16 10:46:31,880 processed 1000 events in 3.955166s. 133212 left in queue
2023-07-16 10:46:37,741 processed 1000 events in 3.837912s. 132229 left in queue

we have at most 134k events left in queue
first couple batches: 14, 26, 28, 20 seconds
then stabilizes between 3-5 seconds, with some singleton outliers (5,9,7,18,8,23 seconds)
final batch took 79 seconds and caused cli monitor to disconnect
the CLI monitor is way behind on info - when 200 realizations have had JOB_QUEUE_SUCCESS, it's still showing 75 waiting and 103 running, only 20 finished. indeed, after 22 minutes, it says "Simulation completed", without making much progress on realization states...
finally, after 36 minutes, also the CLI reported completion. indeed, the ert log file had a running duration (last timestamp - first timestamp) of 37 minutes, likewise for ee file, although in the ee file one finds the second to last reporting 14 minutes earlier observed that model was finished, waiting tasks completion...

singelton long durations during cooldown - REASON?!

2023-07-16 10:48:07,737 processed 1000 events in 7.504578s. 118485 left in queue
2023-07-16 10:48:19,034 processed 1000 events in 9.287830s. 117485 left in queue
2023-07-16 10:48:46,323 processed 1000 events in 18.624433s. 115485 left in queue
2023-07-16 10:48:57,499 processed 1000 events in 8.231568s. 114485 left in queue
2023-07-16 10:55:10,196 processed 1000 events in 7.288574s. 54485 left in queue
2023-07-16 10:55:49,015 processed 1000 events in 23.923132s. 51485 left in queue
2023-07-16 11:01:14,219 processed 1000 events in 12.348842s. 1485 left in queue

our branch

we run on branch simple-snapshot-investigation-rebase of my own fork, commit 7cf31473adb, in a komodoenv based on bleeding (different commit compared to above - just a few formal changes to test config files, and rebased onto main of equinor fork)

this gives us ee logs where

we have at most 104k events left in queue
first batches ran in under 1 second
queue buildup had times between 1 and 2 seconds
queue cooldown consistently at < 0.4 seconds
final batch 0.15 seconds
running time according to CLI: 7 minutes, logs agree

GUI

we run with the gui, changing to ensemble experiment and hitting go immediately

bleeding

we run with komodo bleeding (which was green, btw, awesome), which has ert main at commit 5e3164269.

we have at most 132k events left in queue
first couple batches take 15,20,18,15,9,9 seconds (until queue peak)
queue cooldown: between 4 and 8 seconds, with regular outliers above 10 seconds
last batch 64 seconds
last job queue SUCCESS 24 minutes before ert logs finished, ensemble finished normally 23 minutes before ee logs finished
total run time gui 20 minutes 26 seconds, logs files have 27 minutes

our branch

we run on branch simple-snapshot-investigation-rebase of my own fork, commit 7cf31473adb, in a komodoenv based on bleeding (different commit compared to above - just a few formal changes to test config files, and rebased onto main of equinor fork)

this gives us ee logs where

we have at most 109k events left in queue
first batches ran in under 1 second
queue buildup had times between 1 and 3 seconds
queue cooldown consistently at < 0.4 seconds (one outlier at 0.43)
final batch 0.24 seconds
four minutes between last job queue SUCCESS/ensemble finished normally in ee logs and app/logs finishing
running time according to GUI: 7 minutes 40 seconds, logs agree

pinkwah commented 1 year ago

User on Yammer has reported this too: https://web.yammer.com/main/org/statoil.com/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjM2MzQyMTQyNDY4OTE1MiJ9

hnformentin commented 1 year ago

User on Slack reported this too: https://equinor.slack.com/archives/CDPJULWR4/p1692083517380579

valentin-krasontovitsch commented 1 year ago

reopening until we have confirmed working through user tests

valentin-krasontovitsch commented 1 year ago

Just to summarize here a bit our findings so far since merging and releasing this in komodo - we have several users that have run heavy (many realz) workflows, and they finished successfully (besides some RMS issues, that we suspect are related to lack of memory).

we have one or two users running stuff in azure, and they report failures - cf. the slack thread [INTERNAL LINK].

to just have the relevant observations here as well:

they seem to be running with 100 realz

we have long processing times for just a handful of events in the beginning of iter-0 (does not lead to timeouts):

2023-08-22 09:32:45,036 processed 8 events in 4.450264s. 0 left in queue
2023-08-22 09:32:55,920 processed 4 events in 7.749988s. 0 left in queue
2023-08-22 09:33:08,870 processed 1 events in 10.502106s. 0 left in queue
2023-08-22 09:33:34,052 processed 2 events in 22.767731s. 0 left in queue
2023-08-22 09:34:33,443 processed 9 events in 56.638394s. 0 left in queue
2023-08-22 09:35:16,219 processed 7 events in 39.758926s. 0 left in queue
2023-08-22 09:35:24,024 processed 16 events in 5.047750s. 0 left in queue

further processing is fast, no errors or timeouts
come iter-1, we get long processing again, just one instance:

14:24:29,328 Evaluator is done
14:24:34,334 connecting to new monitor...
14:25:17,050 Started evaluator, joining until shutdown
14:25:17,083 connected
14:25:19,009 processed 101 events in 0.007326s. 0 left in queue
14:25:22,372 no events to be processed in queue
14:25:26,195 no events to be processed in queue
14:25:42,516 processed 2 events in 12.302223s. 0 left in queue
14:28:48,042 connection closed error: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
14:28:48,050 connecting to new monitor...
14:28:48,736 processed 6 events in 184.077428s. 0 left in queue

finally we get a bunch of connection closed errors, the usual, until the queue fails to send, taking everything down

this last occurence is rather frustrating - one is tempted to explain the long processing times with slower disk and / or network performance, yet i am not aware of anything disk intense that might be happening at the same time as this error occured (at the beginning of iteration 1). also, after our changes (moving the batching processor into its own thread), we would expect that based on our understanding of the python scheduler, the ensemble evaluator should not experience trouble even though the dispatcher gets busy. it seems our mental model is still faulty.

valentin-krasontovitsch commented 1 year ago

another potentially important note that i don't want to get lost in my private notes - user at johan sverdrup told me that they had fewer difficulties in running an ensemble experiment with 200 realizations when they reduced the number of parameters, using ../input/distributions rather than gen_kw_init_all.txt, used like this in config (name included in <GEN_KW_INIT_NAME>:

GEN_KW INIT_FILES ../input/templates/gen_kw.tmpl dummy.txt <GEN_KW_CONFIG_INIT> INIT_FILES:../input/distributions/<CASE>/<GEN_KW_INIT_NAME>%d.txt

i don't have more details written down, if anybody ever wants to investigate this remark (and doesn't know who the user in question is), ping me

xjules commented 1 year ago

Definition of Done is when when this issue gets resolved: https://github.com/equinor/scout/issues/903

xjules commented 1 year ago

Closing this issue due to the good performance of running 300 realizations described in more detail here

equinor / ert