Realization is not cancelled properly after MAX_RUNTIME

equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).

https://ert.readthedocs.io/en/latest/

GNU General Public License v3.0

103 stars 107 forks source link

Realization is not cancelled properly after MAX_RUNTIME #8359

Closed xjules closed 1 month ago

xjules commented 3 months ago

Flaky test it seems: FAILED tests/integration_tests/status/test_tracking_integration.py::test_tracking[ee_poly_experiment_cancelled_by_max_runtime]

It appears that after realization reaches MAX_RUNTIME the cancellation won't happen and the state would be PENDING instead of FAILED.

 assert match.value == expected, (
                f"{msg_start}{str(match.full_path)} value "
                f"({match.value}) is not equal to ({expected})"
            )
E           AssertionError: Snapshot 0 did not match:
E             reals.1.forward_models.0.status value (Pending) is not equal to (Failed)
E           assert 'Pending' == 'Failed'

Please see for more details:

https://github.com/equinor/ert/actions/runs/10143243141/job/28044195831

xjules commented 3 months ago

Reproduced when Scheduler._publisher() is slow - injecting some asyncio.sleep(5) before await conn.send(event) should make do. This naturally will prevent all the events to be sent when MAX_RUNTIME will cancel all realizations and thus the realization.STATUS will be either RUNNING or PENDING.

Update: it also looks like to _publisher stopped publishing, so suggestion is to wait if this error appears again.

eivindjahren commented 3 months ago

Another failure here: https://github.com/equinor/ert/actions/runs/10193785679/job/28199058436?pr=8362

xjules commented 3 months ago

Further updates: dispatcher_task that runs _batch_events_into_buffer is unable the correctly append the last batch of events due the task being cancelled after the realization is hit by max_runtime. Reproduction of the issue is not so straightforward, but injecting await asyncio.sleep(xx) into _batch_events_into_buffer in the beginning should do.

Couldn't reproduce it still. It seems that the source for the problem is the Scheduler._publisher

xjules commented 2 months ago

Another failure occurred: https://github.com/equinor/ert/actions/runs/10454681280/job/28947875443?pr=8424

xjules commented 2 months ago

Let's see if this PR might help: https://github.com/equinor/ert/pull/8424

xjules commented 2 months ago

I haven't seen this error since, so closing this one.

xjules commented 2 months ago

Another entry https://github.com/equinor/ert/actions/runs/10683277953/job/29610991386?pr=8629

larsevj commented 2 months ago

Getting this consistently on all of my PRs now, passes on second run and only happens for python3.8.

eivindjahren commented 2 months ago

Another failure here: https://github.com/equinor/ert/actions/runs/10719695610/job/29724320771?pr=8648 , but on 3.12

xjules commented 2 months ago

And here as well: https://github.com/equinor/ert/actions/runs/10719771718/job/29724565181?pr=8649

xjules commented 1 month ago

Seems like this issue still persists: https://github.com/equinor/ert/actions/runs/10738432908/job/29782008935

xjules commented 1 month ago

Copying a message from @eivindjahren, which includes suggestion to move the test into cli section.

I think this is a test concurrency problem (tests affecting each other when running at the same time). I also got a pretty persistent bug running test_tracking_missing_ecl after reorganizing the tests so that integration tests and unit tests ran at the same time. I had to fix it by making it a cli test instead:

sondreso commented 1 month ago

This works consistently now, though it's far from ideal that just moving the test around fixed it (there is some underlaying timing dependency here). However, since we cannot reproduce this, we will close until it re-surfaces.