Closed oyvindeide closed 1 year ago
after some advice from Øyvind i found the poly template and used a script to generate lots of data, and ran esmda locally on my machine. no problem, went from itereation 0 to iteration 3.
on ert main prepare a big case
cd tests/performance_tests
python3 performance_utils.py
copy case over to local ert checkout, to have control over files and add to git we assume to be in ert dir
cp /tmp/[TEMP_DIR]/poly test-data/big-poly-case
all config modifications are done in poly.ert, see branch for details
checked out above setup, changed queue system to LSF
, and started by running
ert es_mda --realizations 0-199 --target-case default_%d --weights '4,2,1' poly.ert
(i prefer putty over the gui)
this could be connected to https://github.com/equinor/ert/issues/4708
branch on my fork for reproducing the issue: https://github.com/valentin-krasontovitsch/ert/tree/test-200-realizations ping me here or on slack in case you wonder if the latest changes have been pushed!
so after realizing that the local reproduction was due to moving the laptop and changing the wifi controller's ip address, we were kind of back to square one, but learned a lot about the websocket communication in ert.
now we are trying to
we copied the latest release of drogon [INTERNAL LINK], grabbing the tar archive from the repo and unpacking it locally on an rgs node.
we're gonna run the drogon_ahm.ert
config file, using the ESMDA algorithm.
to increase the number of realizations, we had to
NUM_REALIZATIONS 300
in the config file../input/distributions/design_matrix_ahm.xlsx
, adding ca. 200 more rows by
SENSNAME
, SENSCASE
) exactly (constant value in each row)finally just run (for command line execution):
ert es_mda --realizations 0-299 --target-case default_%d --weights '4,2,1' drogon_ahm.ert
or start the gui and do the thing (esmda default options) there.
what we see at the end of the ert logs (ert-... in the logs
folder next to the config file), in this case from drogon:
2023-05-22 10:15:04,987 - res.job_queue - LegacyEnsemble - DEBUG - Set DROGON-298(298) to JOB_QUEUE_FAILED
2023-05-22 10:15:04,998 - res.job_queue - LegacyEnsemble - DEBUG - Set DROGON-299(299) to JOB_QUEUE_FAILED
2023-05-22 10:17:20,990 - root - MainThread - DEBUG - Resetting environment variable ERT_SITE_CONFIG
2023-05-22 10:17:20,991 - root - MainThread - DEBUG - Deleting temporary directory for site-config
2023-05-22 10:17:20,992 - ert.__main__ - MainThread - ERROR - Simulation failed! All realizations failed!
Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Realization: 122 failed after reaching max submit with:
Realization: 145 failed after reaching max submit with:
Realization: 161 failed after reaching max submit with:
Realization: 25 failed after reaching max submit with:
Realization: 41 failed after reaching max submit with:
unexpected exception in queue
Traceback (most recent call last):
File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/__main__.py", line 602, in main
args.func(args, context.plugin_manager)
File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/cli/main.py", line 169, in run_cli
raise ErtCliError(model.getFailMessage())
ert.cli.main.ErtCliError: Simulation failed! All realizations failed!
Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Realization: 122 failed after reaching max submit with:
Realization: 145 failed after reaching max submit with:
Realization: 161 failed after reaching max submit with:
Realization: 25 failed after reaching max submit with:
Realization: 41 failed after reaching max submit with:
unexpected exception in queue
2023-05-22 10:17:20,998 - root - MainThread - INFO - Peak memory use: 1490276 kB
this happens after trying to kill all (pending / running / waiting / i dunno) jobs, while the original exception typically occurs a bit earlier, like such:
2023-05-22 10:14:46,928 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - ERROR - Connection to websocket wss://143.97.182.178:51827/dispatch failed, unable to publish changes
Traceback (most recent call last):
File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 390, in _publish_changes
async with connect(
File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
return await self
File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
2023-05-22 10:14:46,950 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - ERROR - unexpected exception in queue
Traceback (most recent call last):
File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 454, in execute_queue_via_websockets
await JobQueue._publish_changes(
File "/private/vkra/venvs/2023.05-200-real/root/lib64/python3.8/site-packages/ert/_c_wrappers/job_queue/queue.py", line 390, in _publish_changes
async with connect(
File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
return await self
File "/prog/res/komodo/2023.05.00-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
after trying to retry to connect 10 times, with certain backup as programmed by us into this our queue component.
checking the logs of the ensemble evaluator (ee) around the timestamps of the job queue not being able to connect to the ee websockets (ws) server, we see a pattern of the ee's "queue" running full, or at least getting busy, like such (resulting of running grep -e 'processed' ee-log...
):
2023-05-22 10:09:22,075 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 15 events in 0.000029s. 0 left in queue
2023-05-22 10:09:31,989 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 22 events in 0.000033s. 0 left in queue
2023-05-22 10:09:44,877 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 996 events in 0.000561s. 0 left in queue
2023-05-22 10:09:53,879 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 39 events in 0.000040s. 0 left in queue
2023-05-22 10:10:05,196 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 167 events in 0.000113s. 0 left in queue
2023-05-22 10:10:16,641 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 30 events in 0.000021s. 0 left in queue
2023-05-22 10:10:24,468 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004492s. 1224 left in queue
2023-05-22 10:10:41,075 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000685s. 9404 left in queue
2023-05-22 10:10:52,382 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000919s. 8780 left in queue
2023-05-22 10:11:00,005 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.014857s. 9318 left in queue
2023-05-22 10:11:20,321 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004537s. 12905 left in queue
2023-05-22 10:11:24,130 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.013159s. 11981 left in queue
2023-05-22 10:11:34,379 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004296s. 11158 left in queue
2023-05-22 10:11:52,142 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001561s. 10764 left in queue
2023-05-22 10:12:03,758 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.004955s. 10111 left in queue
2023-05-22 10:12:20,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.013619s. 9434 left in queue
2023-05-22 10:12:35,700 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.003865s. 8757 left in queue
2023-05-22 10:12:50,083 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001116s. 7983 left in queue
2023-05-22 10:13:05,309 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.014509s. 7308 left in queue
2023-05-22 10:13:16,903 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.012791s. 6541 left in queue
2023-05-22 10:13:32,172 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000586s. 5804 left in queue
2023-05-22 10:13:45,588 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001132s. 5106 left in queue
2023-05-22 10:14:00,905 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001094s. 13375 left in queue
2023-05-22 10:14:09,699 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001778s. 13253 left in queue
2023-05-22 10:14:28,020 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.008381s. 14137 left in queue
2023-05-22 10:14:37,003 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001520s. 13688 left in queue
2023-05-22 10:15:06,004 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.003402s. 13087 left in queue
2023-05-22 10:15:13,608 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.002020s. 12604 left in queue
2023-05-22 10:15:20,765 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001408s. 12035 left in queue
2023-05-22 10:15:25,816 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001443s. 11035 left in queue
2023-05-22 10:15:31,057 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001367s. 10035 left in queue
2023-05-22 10:15:36,123 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001199s. 9035 left in queue
2023-05-22 10:15:41,267 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.001048s. 8035 left in queue
2023-05-22 10:15:46,206 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000992s. 7035 left in queue
2023-05-22 10:15:51,013 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000887s. 6035 left in queue
2023-05-22 10:15:57,388 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000880s. 5035 left in queue
2023-05-22 10:16:02,627 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000696s. 4035 left in queue
2023-05-22 10:16:07,768 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000719s. 3035 left in queue
2023-05-22 10:16:14,433 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000349s. 2035 left in queue
2023-05-22 10:16:19,399 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000595s. 1035 left in queue
2023-05-22 10:16:25,038 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 0.000469s. 35 left in queue
2023-05-22 10:16:30,131 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 35 events in 0.000038s. 0 left in queue
2023-05-22 10:16:32,474 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 0 events in 0.000015s. 0 left in queue
2023-05-22 10:16:32,479 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 0 events in 0.000008s. 0 left in queue
This correlation was very clear for the drogon case, in the sense that the ee's queue got busy / full exactly when the job_queue could not connect to the ee ws server anymore.
For the first user report, the case was less clear - the ee queue got busier when the job_queue could not reconnect, but not full to the same degree as above.
An inspection of the logs connected to the user report on yammer also shows a correlation, in the sense that the ee queue runs full when the job queue cannot connect anymore, but the ee queue also get's very full (even fuller than when the job_queue fails), and the job_queue reports no trouble...
create a personal dir (no data should leave the assets environment):
mkdir /project/johan_sverdrup/resmod/ff/users/[YOUR_USERNAME]
the latest model can be found in
/project/johan_sverdrup/resmod/ff/users/tomal/23.0.1_16022023_surface2_poly_newrms
copy the whole thing over, and you may find the ert config file we've been using in (relative paths now)
ert/model/newrms_23p0p1_history.ert
there is one little script that needs updating - in case your copy didn't have the update, here it is - apply this diff in your model in rms/bin/js_fmu_rms_scripts
:
diff --git a/python/compress_grdecl.py b/python/compress_grdecl.py
index b5017eb..d4585bc 100644
--- a/python/compress_grdecl.py
+++ b/python/compress_grdecl.py
@@ -79,7 +79,14 @@ def main(files, keeporiginal=False, dryrun=False):
if not keeporiginal:
os.remove(filename + ".orig")
-
+ with open (filename, 'rt') as f:
+ data = f.read()
+ data = data.replace('METRES 3*', 'METRES')
+ f.close()
+ with open (filename, 'wt') as f:
+ f.write(data)
+ f.close()
+
def chunks(l, n):
"""Yield successive n-sized chunks as strings from list l."""
for i in range(0, len(l), n):
python3 -m pip install .
(run from source folder of ert repo)pip install --no-deps .
from a local checkoutthe scratch path will be /scratch/johan_sverdrup/[YOUR_USERNAME]
, so no need to modify anything there.
start it with ert gui newrms_23p0p1_history.ert
, and run two pre simulation workflows:
for rerunning, you might want to rename the /scratch/johan_sverdrup/[USERNAME]/[CASENAME]
modifying the case name (removing takes a long time, and you might want to hold on to the logs in the run path)
ran said model with a branch based on somewhat current main, with several fixes included (timeouts increased to 60 seconds, job queue is reusing websockets connection instead of creating new connection every time it publishes (possibly empty list of) changes), starting may 31st.
log files are in /project/johan_sverdrup/resmod/ff/users/vkra/23.0.1_16022023_surface2_poly_newrms/ert/model/logs
with timestamp 2023-05-31T1427
scratch has been moved to /scratch/johan_sverdrup/vkra/23p0p1_new_rms_test_2023-05-31
we have the following situation:
another thread / idea to follow is that one user is running a setup with 300 realizations successfully, using komodo 2022.11.rc7. could check the diff between that version and 2022.11.02 as per initial report...
TODO:
describe work-around for RMS update (see teams! - look at rms/bin/js_fmu_rms_scripts
): apply the following diff in your model, in the above mentioned folder:
diff --git a/python/compress_grdecl.py b/python/compress_grdecl.py
index b5017eb..d4585bc 100644
--- a/python/compress_grdecl.py
+++ b/python/compress_grdecl.py
@@ -79,7 +79,14 @@ def main(files, keeporiginal=False, dryrun=False):
if not keeporiginal:
os.remove(filename + ".orig")
def chunks(l, n): """Yield successive n-sized chunks as strings from list l.""" for i in range(0, len(l), n):
test-threading-sleep-with-poly
on valentin-krasontovitsch fork that makes the job queue node threads sleep longercomparison between ert versions in mentioned komodo releases https://github.com/equinor/ert/compare/4.1.0-rc3...4.1.4
NB: When the jobs are only pending we need to ssh into the compute cluster first, which in a default case in Stavanger would be: /usr/bin/ssh st-grid03.st.statoil.no
Updating the issue with the note from @valentin-krasontovitsch about when we got the heavy event processing:
2023-06-01 20:36:36,539 - ert.ensemble_evaluator.dispatch - processed 31 events in 4.172715s. 88 left in queue
2023-06-01 20:37:04,060 - ert.ensemble_evaluator.dispatch - processed 168 events in 25.486581s. 199 left in queue
2023-06-01 20:40:14,327 - ert.ensemble_evaluator.dispatch - processed 199 events in 148.126047s. 9526 left in queue
2023-06-01 20:47:51,125 - ert.ensemble_evaluator.dispatch - processed 1000 events in 446.345663s. 30881 left in queue
2023-06-01 20:49:33,615 - ert.ensemble_evaluator.dispatch - processed 1000 events in 28.171717s. 38835 left in queue
2023-06-01 20:55:31,811 - ert.ensemble_evaluator.dispatch - processed 1000 events in 210.772313s. 58862 left in queue
2023-06-01 20:58:12,007 - ert.ensemble_evaluator.dispatch - processed 1000 events in 149.863073s. 57862 left in queue
This problem can be reproduced on the poly_example with something along the following:
QUEUE_OPTION LOCAL MAX_RUNNING 500
NUM_REALIZATIONS 500
and repeat SIMULATION_JOB poly_eval
maybe 50-100 times.
in poly_eval.py
add
import time
and
time.sleep(1)
at the end of main
.
When reproducing the poly_example with valentin/test-threading-sleep-with-poly
branch It seems that all went well 👍
No timeout errors.
Nevertheless, the GUI was extremely not responsive during the actual run. We might need to look at that.
sadly the fixes were not enough, reopening
latest test run of user model with "fixes" of
finished iter-0, and then the gui was basically hanging. while the server was unresponsive, because very busy dealing with messages from other clients, the gui was trying to reconnect. previously, it failed after three unsuccessful attempts. the "fix" we implemented yesterday was to not let the gui fail, but let it retry indefinitely. the result is that it does not fail (yeahy), but while it's not connected, it also doesn't receive the messages it should get from the server (status updates about forward models and realizations) - once it reconnects, it waits for messages, but everything is finished (which it doesn't know), and it stays spinning, waiting for updates that never come. that is at least my current understanding.
several ideas:
exploring the idea of disabling the ping pong mechanism on the client side.
testing locally with a simple client and server script, we see that
ping_timeout
to None
in the client's connect call stops the client from closing the connection on timed out pings - according to the docs, ping pongs are still sent, we just don't care about timeouts, so the TCP connection is still kept alive, hopefully:Once the connection is open, a Ping_ frame is sent every
ping_interval
seconds. This serves as a keepalive. It helps keeping the connection
open, especially in the presence of proxies with short timeouts on
inactive connections. Setping_interval
to :obj:None
to disable
this behavior... _Ping: https://www.rfc-editor.org/rfc/rfc6455.html#section-5.5.2
If the corresponding Pong_ frame isn't received within
ping_timeout
seconds, the connection is considered unusable and is closed with code
- This ensures that the remote endpoint remains responsive. Set
ping_timeout
to :obj:None
to disable this behavior... _Pong: https://www.rfc-editor.org/rfc/rfc6455.html#section-5.5.3
ConnectionRefused
error because nothing is listening on the port anymore. that's potentially bad news for us, as previous trials seemed to indicate that the client fails immediately if nothing is reachable under the given address. should explore undesirable states and figure out if it's a problem or notnot wanting to waste much time on simulating things in silly scripts, i proceeded to try and run the beefed up poly example again, with main + gui reconnects indefinitely + job runner waits (sleeps) a bit with pushing events after initial no-sleep time.
locally, on my own machine, using the local queue, and running with MAX_RUNNING 14
to not overstrain the meager 16 cores, the results were not satisfactory - it just ran, processing times were not immense, and i stopped it after a short while.
but on an RGS node, using the LSF queue and no limitation on how many realizations should run concurrently, event processing times went through the roof, and the gui soon disconnected. nice. reproduction of the user case achieved.
disabling all ping timeout sadly did not fix the situation. somehow, connections are still getting dropped. i cannot discern how and why the connection gets dropped, but the queue soon drops into a reconnect loop:
websockets.client - LegacyEnsemble - INFO - ! connect failed again; retrying in 55 seconds
Traceback (most recent call last):
File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 607, in __aiter__
async with self as protocol:
File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 642, in __aenter__
return await self
File "/prog/res/komodo/2023.05.02-py38-rhel7/root/lib64/python3.8/site-packages/websockets/legacy/client.py", line 659, in __await_impl_timeout__
return await asyncio.wait_for(self.__await_impl__(), self.open_timeout)
File "/opt/rh/rh-python38/root/usr/lib64/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
i'm also seeing a lot of bad request logs
websockets.server - ert_ee_run_server - INFO - connection failed (400 Bad Request)
to dig deeper here, one might turn on more debug logging, and perhaps repeat the experiment but with just one realization, to reduce the amount of noise / data, while inserting an artifical sleep into the event processing to simulate a long processing time. this could help us understand how and why connections with the various clients are behaving.
regardless, the idea of just disabling the timeouts does not seem good enough.
EDIT: looking a bit more at the still running gui, it got updates - from the job runners. it's just the job queue that drops the connection - probably because it's trying to send! and the server is not responsive / ready to receive...
after running a bit more and experimenting a bit more with this and that, and finally starting to profile stuff (yappi, for easy multi-threaded profiling), we have come to the conclusion that the way forward is one of the following possiblities:
this would solve like almost all our issues immediately (maybe hopefully), but
we decide to not go for this for now
we try to increase some more sleeps, decrease processing buffer, and maybe implement a pattern where the ee server waits for the gui to reconnect, and then hope for the best (with possible delays in the gui / for short running stuff)
profiling seems to show (i literally have no idea what i'm doing and whether i did it right) that recursive update takes a lot of time.
we could investigate what's up with that, in the particular use case of forward model events, and see if we can find a faster implementation.
could also add some tests for making sure it works and for benchmarking the speed.
re: duct-tape.
we
this lead to
big_poly_case
still had not finished, even though all realizations had finishedoverall, I think that our adjustment maybe was good - went in the right direction, this was after all the (hopefully) somewhat crazy artificial case with 200 realizations and 200 silly fast-running forward models - but perhaps too conservative.
gonna try a more dynamic approach next, we're we any time the events processing time goes >= 10 seconds, we reduce the batch size and increase the sleep (of the batching dispatcher), reverting if processing time goes down.
this seems to do some tricks - events are not lagging behind anymore, they're trickling in and getting processed in due time. no runreasonable processing times, a handful above 10 (top: 35).
yet it seems that execution of forward models is delayed on compute nodes: after 30 minutes, most realizations have proceeded to about forward model number 115. keep in mind that each of the first ca. 200 forward models is just a copy file job, with the source being a short text file.
perhaps the sleep in the job runner can be shortened or removed.
this feels like fiddling with something, and possibly making a brittle solution... user model tests will show more.
EDIT: when all realizations of the first iteration are finally done, we get a bit of a storm again, with a few timeout. the expected and yet annoying thing is the delay in the gui / tracker though - here, instead of every five seconds, the next update (i'm using CLI) is printed after minutes.
so after one hour, all realizations are finished, but at one hour 20 minutes, the cli is still saying all realizations are running...
perhaps remove the sleep in the evaluator tracker :grimacing: also want to try and remove / reduce the sleep in the job runner. let the batching dispatcher be our dynamic bottleneck, the gatekeeper, if you will.
EDIT EDIT: next morning:
2023-06-27 23:48:32,555 - res.job_queue - Thread-174 - DEBUG - Set POLY_SUMMARY(170) to JOB_QUEUE_DONE
has only started to log some lines of
2023-06-28 04:05:35,280 - res.job_queue - Thread-7 - DEBUG - Set POLY_SUMMARY(3) to JOB_QUEUE_SUCCESS
and
2023-06-28 04:05:35,366 - ert._c_wrappers.job_queue.job_queue_node - Thread-18 - INFO - Realization: 14 complete, starting to load results
which is strange.
the whole last block of logs I see at the moment is
...
2023-06-27 23:48:31,966 - res.job_queue - Thread-169 - DEBUG - Set POLY_SUMMARY(165) to JOB_QUEUE_DONE
2023-06-27 23:48:32,555 - res.job_queue - Thread-174 - DEBUG - Set POLY_SUMMARY(170) to JOB_QUEUE_DONE
2023-06-27 23:50:03,919 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - WARNING - job queue dropped connection to ensemble evaulator - going to try and reconnect
2023-06-28 04:05:35,280 - res.job_queue - Thread-7 - DEBUG - Set POLY_SUMMARY(3) to JOB_QUEUE_SUCCESS
2023-06-28 04:05:35,366 - ert._c_wrappers.job_queue.job_queue_node - Thread-18 - INFO - Realization: 14 complete, starting to load results
2023-06-28 04:05:52,701 - ert._c_wrappers.job_queue.queue - LegacyEnsemble - WARNING - job queue dropped connection to ensemble evaulator - going to try and reconnect
2023-06-28 04:05:55,094 - res.job_queue - Thread-16 - DEBUG - Set POLY_SUMMARY(12) to JOB_QUEUE_SUCCESS
2023-06-28 04:05:55,388 - ert._c_wrappers.job_queue.job_queue_node - Thread-21 - INFO - Realization: 17 complete, starting to load results
2023-06-28 04:07:16,190 - res.job_queue - Thread-12 - DEBUG - Set POLY_SUMMARY(8) to JOB_QUEUE_SUCCESS
2023-06-28 04:07:16,581 - ert._c_wrappers.job_queue.job_queue_node - Thread-43 - INFO - Realization: 39 complete, starting to load results
2023-06-28 04:07:27,172 - res.job_queue - Thread-15 - DEBUG - Set POLY_SUMMARY(11) to JOB_QUEUE_SUCCESS
2023-06-28 04:07:27,226 - ert._c_wrappers.job_queue.job_queue_node - Thread-9 - INFO - Realization: 5 complete, starting to load results
2023-06-28 04:08:00,392 - res.job_queue - Thread-10 - DEBUG - Set POLY_SUMMARY(6) to JOB_QUEUE_SUCCESS
2023-06-28 04:08:00,546 - ert._c_wrappers.job_queue.job_queue_node - Thread-35 - INFO - Realization: 31 complete, starting to load results
2023-06-28 04:08:49,773 - res.job_queue - Thread-11 - DEBUG - Set POLY_SUMMARY(7) to JOB_QUEUE_SUCCESS
2023-06-28 04:08:49,780 - ert._c_wrappers.job_queue.job_queue_node - Thread-39 - INFO - Realization: 35 complete, starting to load results
2023-06-28 04:10:36,825 - res.job_queue - Thread-25 - DEBUG - Set POLY_SUMMARY(21) to JOB_QUEUE_SUCCESS
2023-06-28 04:10:36,832 - ert._c_wrappers.job_queue.job_queue_node - Thread-22 - INFO - Realization: 18 complete, starting to load results
2023-06-28 04:11:13,892 - res.job_queue - Thread-23 - DEBUG - Set POLY_SUMMARY(19) to JOB_QUEUE_SUCCESS
2023-06-28 04:11:14,395 - ert._c_wrappers.job_queue.job_queue_node - Thread-4 - INFO - Realization: 0 complete, starting to load results
2023-06-28 04:11:47,510 - res.job_queue - Thread-17 - DEBUG - Set POLY_SUMMARY(13) to JOB_QUEUE_SUCCESS
2023-06-28 04:11:47,534 - ert._c_wrappers.job_queue.job_queue_node - Thread-28 - INFO - Realization: 24 complete, starting to load results
2023-06-28 04:12:00,873 - res.job_queue - Thread-5 - DEBUG - Set POLY_SUMMARY(1) to JOB_QUEUE_SUCCESS
2023-06-28 04:12:01,150 - ert._c_wrappers.job_queue.job_queue_node - Thread-24 - INFO - Realization: 20 complete, starting to load results
i should also say that there were already yesterday a few connection errors, but i thought that they were not significant - here's the logs:
... [the line below repeated many times, i think this was all the realizations / job runners coming to a close]
2023-06-27 23:48:18,348 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from _ert_job_runner.client done!
2023-06-27 23:48:18,349 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 100 events in 6.735008s. 1007 left in queue
2023-06-27 23:48:18,455 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from _ert_job_runner.client done!
2023-06-27 23:50:03,935 - ert.ensemble_evaluator.evaluator - ert_cli_simulation_thread - DEBUG - Connection closed unexpectedly in run_and_get_successful_realizations: sent 1011 (unexpected error) keepalive ping t
2023-06-27 23:50:04,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1000 events in 103.618057s. 35 left in queue
2023-06-27 23:50:04,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-27 23:50:04,251 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection closed error: sent 1011 (unexpected error) keepalive ping timeout; no close frame received
2023-06-27 23:50:04,252 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:50:08,859 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - ERROR - a dispatcher abruptly closed a websocket: received 1011 (unexpected error) keepalive ping timeout; then sent 1011 (unexpecte
2023-06-27 23:50:08,859 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from ert._c_wrappers.job_queue.queue done!
2023-06-27 23:50:09,667 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /dispatch from ert._c_wrappers.job_queue.queue
2023-06-27 23:50:19,673 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /client from ert.ensemble_evaluator.monitor
2023-06-27 23:51:07,276 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:51:07,277 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:52:11,094 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:52:11,095 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:53:14,338 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connection refused: Connection timeout to host wss://10.217.147.132:51820/healthcheck
2023-06-27 23:53:14,339 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connecting to new monitor...
2023-06-27 23:54:00,126 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 100 events in 137.251784s. 43 left in queue
2023-06-27 23:54:00,127 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-27 23:54:01,816 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - ERROR - a dispatcher abruptly closed a websocket: received 1011 (unexpected error) keepalive ping timeout; then sent 1011 (unexpecte
2023-06-27 23:54:01,817 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - handling connection from ert._c_wrappers.job_queue.queue done!
2023-06-27 23:54:11,718 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /client from ert.ensemble_evaluator.monitor
2023-06-27 23:54:12,150 - ert.ensemble_evaluator.drainer - DrainerThread - DEBUG - connected
2023-06-27 23:56:46,759 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 43 events in 57.427320s. 0 left in queue
2023-06-27 23:56:46,759 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - adjusting batch size and timeout...
2023-06-28 04:05:53,449 - ert.ensemble_evaluator.evaluator - ert_ee_run_server - INFO - got new connection for /dispatch from ert._c_wrappers.job_queue.queue
2023-06-28 04:06:38,553 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 1 events in 2.892798s. 0 left in queue
2023-06-28 04:07:36,981 - ert.ensemble_evaluator.dispatch - ert_ee_run_server - DEBUG - processed 2 events in 3.770428s. 0 left in queue
interpretation:
maybe the gui missed something while it got disconnected
maybe the JobQueue
is not actually done yet, as the JobQueueNode
s don't seem to be done yet, as they still have to run what i think is in the callbacks, i.e. load results (or is that a different part of the code? not sure, haven't looked at the details yet) - in that case the gui is correctly still waiting for iteration 0 to finish.
but why would the job queue (nodes) be lagging so much!?
EDIT: After further investigation: from job_queue_status.cpp
:
JOB_QUEUE_DONE; // Job is done (successful or not), temporary state - controlled/returned by by queue_driver JOB_QUEUE_SUCCESS; // All good, comes after JOB_QUEUE_DONE, with additional checks, FINAL STATE - controlled by job_queue
and looking a bit further, indeed, the JobQueueNode
(cf. job_queue_node.py
) logs starting to load results
and calls the callback after we have JOB_QUEUE_DONE
; after that, we may get JOB_QUEUE_SUCCESS
, and finally ThreadStatus.DONE
for the JobQueueNode
. Once all JobQueueNode
s have ThreadStatus.DONE
, the JobQueue
will switch state to not is_active
, and finish
but why is there so much time between the C part of JobQueueNode
logging Set POLY_SUMMARY(170) to JOB_QUEUE_DONE
and the python part noticing this, and going on to the callback???
to investigate the last question, i've
and run the big poly case again.
this is all btw happening on the branch detailed-study-many-realz
on my fork...
so after a bit of work was done mostly by @berland, which as far as we could tell massively sped up the processing that was being logged, and after fixing all (that we found) remaining bugs the came with the rewrite of the (partial) snapshot class, we had another very optimistic and hopeful test of the user model. we failed. we saw timeouts again :sob: and we saw durations of up to 900 seconds!
2023-07-15 15:55:48,417 processed 413 events in 130.765714s. 0 left in queue
2023-07-15 16:03:14,224 processed 1000 events in 317.391545s. 4517 left in queue
2023-07-15 16:07:55,578 processed 1000 events in 197.217101s. 12116 left in queue
2023-07-15 16:10:40,980 processed 1000 events in 110.467765s. 12591 left in queue
2023-07-15 16:18:08,543 processed 1000 events in 351.514885s. 11459 left in queue
2023-07-15 16:21:09,084 processed 1000 events in 112.926649s. 17759 left in queue
2023-07-15 16:26:57,042 processed 1000 events in 161.048984s. 14759 left in queue
2023-07-15 16:45:45,262 processed 1000 events in 914.408085s. 15263 left in queue
2023-07-15 17:05:02,859 processed 1000 events in 741.990857s. 37063 left in queue
2023-07-15 17:12:23,042 processed 1000 events in 149.200687s. 56336 left in queue
and finally the monitor brought down ert again
2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
ert_gui_simulation_thread - DEBUG - \
run_and_get_successful_realizations caught Connection \
timeout to host wss://10.217.147.132:51839/healthcheck.\
3 unsuccessful attempts
2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
ert_gui_simulation_thread - DEBUG - Max connection attempts reached
2023-07-15 16:01:04,460 - ert.ensemble_evaluator.evaluator - \
ert_gui_simulation_thread - DEBUG - Cancelling current ensemble
i'm kind of at end again, and confused - seems like we've come round again to the original issue.
i'm confused, because I posted on slack that timings were awesome and whatnot - BUT it may be that we did not get good enough of a test. in the case above, we start failing around 16:01 - realization 22 (which always fails) failed at 14:56, ca. 1 hour earlier. if i remember correctly, in the previous case which i thought was triumphant, the run failed because the failure of realization 22 was not handled correctly. this means that the times we saw there where before the actual bad storm that apparently occurs later for this model...
so to ascertain my own sanity, i'm gonna run some basic tests. i'm still unsure what is going on, and why we have these long processing times in the user model when we have them. we have one storm first, starting at around 12:17, but durations don't get up 2 seconds, with 40k max left in queue. then you have the storm starting at around 16:03, with max 68k left in queue, which takes down the app and durations of up to 900 seconds. but at the same time you have outliers:
2023-07-15 15:29:13,617 processed 52 events in 69.471989s. 0 left in queue
2023-07-15 15:49:22,779 processed 138 events in 49.568064s. 0 left in queue
2023-07-15 15:50:32,318 processed 1000 events in 43.629527s. 606 left in queue
and these I cannot explain at all. This big variation in processing times, it either means that
we're gonna try to test the last assumption, and take as a culprit "the gui", trying to run CLI and GUI and compare times.
we run the dummy example:
dummy.ert
given belowdummy-file
has content a few random string, like "hello world"config file:
QUEUE_SYSTEM LOCAL
QUEUE_OPTION LOCAL MAX_RUNNING 300
RUNPATH output/realization-<IENS>/iter-<ITER>
NUM_REALIZATIONS 200
MIN_REALIZATIONS 1
MAX_SUBMIT 1
FORWARD_MODEL COPY_FILE(<FROM>=<CONFIG_PATH>/dummy-file,<TO>=<RUNPATH>/copy-of-dummy-file)
-- repeat above line ca. 228 times
we run with the CLI, using ert ensemble_experiment dummy.ert
we run on branch simple-snapshot-investigation-rebase
of my own fork, commit
49ff6a1d
this gives us ee logs where
we run on main
, which at the moment means commit db9424ced0b
this gives us ee logs where
we run with the gui, changing to ensemble experiment and hitting go immediately
we run on main
, which at the moment means commit db9424ced0b
simlar to CLI run, except that during queue build-up, durations are slightly longer, between 2 and 6 seconds - after that, they stabilize below one second! total duration gui window 8 minutes, ert log file last timestamp minus first timestamp confirms 8 minutes (and 13 seconds)
we run on branch simple-snapshot-investigation-rebase
of my own fork, commit
49ff6a1d
total duration gui - 5 minutes 48 seconds (so there's something we're doing right), logs confirm (5 seconds more)
run as in comment above, except on an RGS node
we needed to change queue options in dummy.ert
to
QUEUE_SYSTEM LSF
QUEUE_OPTION LSF MAX_RUNNING 300
BTW we always cleared storage and output before running again, but sometimes kept logs around
we run with the CLI, using ert ensemble_experiment dummy.ert
we run with komodo bleeding (which was green, btw, awesome), which has ert main
at commit 5e3164269
.
this gives us ee logs where
2023-07-16 10:43:59,066 processed 1 events in 0.034413s. 0 left in queue
2023-07-16 10:44:41,281 processed 358 events in 14.002437s. 0 left in queue
2023-07-16 10:45:23,631 processed 99 events in 26.151736s. 0 left in queue
2023-07-16 10:45:59,836 processed 1000 events in 28.127329s. 44328 left in queue
2023-07-16 10:46:22,825 processed 1000 events in 20.342448s. 94801 left in queue
2023-07-16 10:46:25,912 processed 1000 events in 1.083446s. 134206 left in queue
2023-07-16 10:46:31,880 processed 1000 events in 3.955166s. 133212 left in queue
2023-07-16 10:46:37,741 processed 1000 events in 3.837912s. 132229 left in queue
JOB_QUEUE_SUCCESS
, it's still showing 75 waiting and 103 running, only 20
finished. indeed, after 22 minutes, it says "Simulation completed", without
making much progress on realization states...observed that model was finished, waiting tasks completion...
singelton long durations during cooldown - REASON?!
2023-07-16 10:48:07,737 processed 1000 events in 7.504578s. 118485 left in queue
2023-07-16 10:48:19,034 processed 1000 events in 9.287830s. 117485 left in queue
2023-07-16 10:48:46,323 processed 1000 events in 18.624433s. 115485 left in queue
2023-07-16 10:48:57,499 processed 1000 events in 8.231568s. 114485 left in queue
2023-07-16 10:55:10,196 processed 1000 events in 7.288574s. 54485 left in queue
2023-07-16 10:55:49,015 processed 1000 events in 23.923132s. 51485 left in queue
2023-07-16 11:01:14,219 processed 1000 events in 12.348842s. 1485 left in queue
we run on branch simple-snapshot-investigation-rebase
of my own fork, commit
7cf31473adb
, in a komodoenv based on bleeding (different commit compared to
above - just a few formal changes to test config files, and rebased onto main
of equinor fork)
this gives us ee logs where
we run with the gui, changing to ensemble experiment and hitting go immediately
we run with komodo bleeding (which was green, btw, awesome), which has ert main
at commit 5e3164269
.
ensemble finished normally
23 minutes before ee logs finishedwe run on branch simple-snapshot-investigation-rebase
of my own fork, commit
7cf31473adb
, in a komodoenv based on bleeding (different commit compared to
above - just a few formal changes to test config files, and rebased onto main
of equinor fork)
this gives us ee logs where
ensemble finished normally
in
ee logs and app/logs finishingUser on Yammer has reported this too: https://web.yammer.com/main/org/statoil.com/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjM2MzQyMTQyNDY4OTE1MiJ9
User on Slack reported this too: https://equinor.slack.com/archives/CDPJULWR4/p1692083517380579
reopening until we have confirmed working through user tests
Just to summarize here a bit our findings so far since merging and releasing this in komodo - we have several users that have run heavy (many realz) workflows, and they finished successfully (besides some RMS issues, that we suspect are related to lack of memory).
we have one or two users running stuff in azure, and they report failures - cf. the slack thread [INTERNAL LINK].
to just have the relevant observations here as well:
they seem to be running with 100 realz
we have long processing times for just a handful of events in the beginning of iter-0 (does not lead to timeouts):
2023-08-22 09:32:45,036 processed 8 events in 4.450264s. 0 left in queue 2023-08-22 09:32:55,920 processed 4 events in 7.749988s. 0 left in queue 2023-08-22 09:33:08,870 processed 1 events in 10.502106s. 0 left in queue 2023-08-22 09:33:34,052 processed 2 events in 22.767731s. 0 left in queue 2023-08-22 09:34:33,443 processed 9 events in 56.638394s. 0 left in queue 2023-08-22 09:35:16,219 processed 7 events in 39.758926s. 0 left in queue 2023-08-22 09:35:24,024 processed 16 events in 5.047750s. 0 left in queue
further processing is fast, no errors or timeouts
come iter-1, we get long processing again, just one instance:
14:24:29,328 Evaluator is done 14:24:34,334 connecting to new monitor... 14:25:17,050 Started evaluator, joining until shutdown 14:25:17,083 connected 14:25:19,009 processed 101 events in 0.007326s. 0 left in queue 14:25:22,372 no events to be processed in queue 14:25:26,195 no events to be processed in queue 14:25:42,516 processed 2 events in 12.302223s. 0 left in queue 14:28:48,042 connection closed error: sent 1011 (unexpected error) keepalive ping timeout; no close frame received 14:28:48,050 connecting to new monitor... 14:28:48,736 processed 6 events in 184.077428s. 0 left in queue
- finally we get a bunch of connection closed errors, the usual, until the queue fails to send, taking everything down
this last occurence is rather frustrating - one is tempted to explain the long processing times with slower disk and / or network performance, yet i am not aware of anything disk intense that might be happening at the same time as this error occured (at the beginning of iteration 1). also, after our changes (moving the batching processor into its own thread), we would expect that based on our understanding of the python scheduler, the ensemble evaluator should not experience trouble even though the dispatcher gets busy. it seems our mental model is still faulty.
another potentially important note that i don't want to get lost in my private notes - user at johan sverdrup told me that they had fewer difficulties in running an ensemble experiment with 200 realizations when they reduced the number of parameters, using ../input/distributions
rather than gen_kw_init_all.txt
, used like this in config (name included in <GEN_KW_INIT_NAME>
:
GEN_KW INIT_FILES ../input/templates/gen_kw.tmpl dummy.txt <GEN_KW_CONFIG_INIT> INIT_FILES:../input/distributions/<CASE>/<GEN_KW_INIT_NAME>%d.txt
i don't have more details written down, if anybody ever wants to investigate this remark (and doesn't know who the user in question is), ping me
Definition of Done is when when this issue gets resolved: https://github.com/equinor/scout/issues/903
Describe the bug A user has reported [INTERNAL LINK] that ert fails in the start of iter-1. Iter-0 completes successfully, and the analysis step seems to run fine. The ensemble setup ran fine with 50 realizations, but 200 did not work, nor did 100. At first it was suspected that they ran out of memory, but that seems not to be the case.
Fails here: https://github.com/equinor/ert/blob/e21998046cc85d6dfbcc4eb0c3fa5c370b151a71/src/ert/_c_wrappers/job_queue/queue.py#L471
Environment Note! Using komodo 2022.11.02, but the queue interaction should not be that different from
main
.