equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
104 stars 108 forks source link

Rerun LSF tests on ssh failures (error code 255) #8657

Open berland opened 2 months ago

berland commented 2 months ago

It is known that the commands for interacting with the LSF cluster goes through a shell wrapper that does a ssh-call to some LSF-server. If that server is too busy to respond to the ssh login, the command will return with error code 255.

This error code can be detected in the integration tests, and then the tests can be retried for some attempts.

https://github.com/equinor/ert/blob/2d21583bae5f52a367c3ea492b2b76bbf07608cc/tests/integration_tests/scheduler/test_lsf_driver.py#L187-L191

Suggestion is to raise a specific exception on this kind of error, and then use pytest-rerunfailures to wait some seconds and then retry a certain number of attempts:

https://pypi.org/project/pytest-rerunfailures/#re-run-individual-failures

berland commented 1 month ago

See https://github.com/equinor/ert/pull/8790, it is no (longer?) true that only SSH errors give 255.

jonathan-eq commented 1 month ago

See #8790, it is no (longer?) true that only SSH errors give 255.

Hmm, you are right! Do we know the error message for flaky ssh? Maybe it is something like connection refused or similar.

jonathan-eq commented 1 month ago

@berland For the sake of the tests, maybe we should rerun on error code 255 anyways. If it is the cluster acting up and not due to ssh, it wouldn't hurt to rerun the failing test.

berland commented 1 month ago

It does not look like it is currently a problem with cluster failures in LSF in our tests, so maybe hold that until it is needed.