Open berland opened 2 months ago
See https://github.com/equinor/ert/pull/8790, it is no (longer?) true that only SSH errors give 255.
See #8790, it is no (longer?) true that only SSH errors give 255.
Hmm, you are right!
Do we know the error message for flaky ssh? Maybe it is something like connection refused
or similar.
@berland For the sake of the tests, maybe we should rerun on error code 255 anyways. If it is the cluster acting up and not due to ssh, it wouldn't hurt to rerun the failing test.
It does not look like it is currently a problem with cluster failures in LSF in our tests, so maybe hold that until it is needed.
It is known that the commands for interacting with the LSF cluster goes through a shell wrapper that does a ssh-call to some LSF-server. If that server is too busy to respond to the ssh login, the command will return with error code 255.
This error code can be detected in the integration tests, and then the tests can be retried for some attempts.
https://github.com/equinor/ert/blob/2d21583bae5f52a367c3ea492b2b76bbf07608cc/tests/integration_tests/scheduler/test_lsf_driver.py#L187-L191
Suggestion is to raise a specific exception on this kind of error, and then use pytest-rerunfailures to wait some seconds and then retry a certain number of attempts:
https://pypi.org/project/pytest-rerunfailures/#re-run-individual-failures