bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Sporadic iRODS errors in CI after iRODS 4.3 env updates #2029

Open mikkonie opened 1 month ago

mikkonie commented 1 month ago

Earlier today I upgraded to python-irodsclient==2.2.0 (#2023), mainly because the release was supposed to fix the redirect problem I was previously expecting.

The initial test run worked fine. However, a subsequent commit which only contained documentation changes had a large number (all?) of iRODS tests failing with connection errors. Example in comments below.

My guesses:

I'll try to gather more information and possibly contact iRODS support.

mikkonie commented 1 month ago

Example of one of the many identical test failures:

======================================================================
ERROR: test_update_parent_override (taskflowbackend.tests.test_plugins.TestPerformRoleModify.test_update_parent_override)
Test updating overriding parent assignment
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 326, in server_version
    return tuple(ast.literal_eval(reported_vsn))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 64, in literal_eval
    node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<unknown>", line 0

SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 332, in __server_version
    conn = next(iter(self.pool.active))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 70, in get_connection
    conn = self.idle.pop()
           ^^^^^^^^^^^^^^^
KeyError: 'pop from an empty set'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 470, in setUp
    super().setUp()
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 42, in setUp
    super().setUp()
  File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/base.py", line 265, in setUp
    self.irods = self.irods_backend.get_session_obj()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 473, in get_session_obj
    return self._init_irods()
           ^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 115, in _init_irods
    raise ex
  File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 102, in _init_irods
    irods.collections.exists(
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 87, in exists
    self.get(path)
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 22, in get
    query = self.sess.query(Collection).filter(*filters)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 290, in query
    return Query(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/query.py", line 45, in __init__
    if self.sess.server_version >= col.min_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 328, in server_version
    return self.__server_version()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 335, in __server_version
    conn = self.pool.get_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 17, in method_
    ret = method(self,*s,**kw)
          ^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 87, in get_connection
    conn = Connection(self, self.account)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 77, in __init__
    auth_module.login(self)
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/auth/native.py", line 3, in login
    conn._login_native()
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 617, in _login_native
    self.recv()
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 152, in recv
    exc = get_exception_by_code(msg.int_info, err_msg)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/exception.py", line 186, in get_exception_by_code
    exc_class = iRODSExceptionMeta.codes[ rounded_code( code ) ]
                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: -167000
mikkonie commented 1 month ago

I totally forgot to check out the container output in the CI job shutdown phase. That tells us that iRODS is correctly initialized, but we have a bunch of "too many clients" errors from Postgres:

FATAL:  sorry, too many clients already

This would suggest something is trying to open too many connections with postgres. I've seen something similar when I had bugs in closing connections from the python-irodsclient and iRODS sessions were left open. However, those bugs have long since been fixed so this should not be an issue. And if it was, it would show up consistently.

I don't think the Django server would end up in any situation where it would flood postgres with open connections. At least I have never encountered that, even with heavy use in production.

It is also possible that this is the symptom, not the cause: something else goes wrong and postgres ends up being flooded with opened connections?

mikkonie commented 1 month ago

Full log of the failed job here: ci_irods_crash.txt

I'll have to look into whether some clues can be derived from this. It would appear like none of the taskflow tests succeed. Perhaps they fail in a way which also leaves postgres connections open.

Another thing of note is the following error I don't recall encountering before. irods.message.Bad_AVU_Field: AVU value ('') is zero-length.

That could be a symptom and not the problem though. I'll have to look at this log file in detail later, currently too busy with other stuff.

mikkonie commented 1 month ago

I'm looking into the full crash log in detail, writing down some facts and thoughts here.

Facts and Observations

Error Types Raised

Conclusions

Since I currently have no way of reproducing this, I'll have to wait for this to happen again and compare the log to this one.

Another possibility would be to try running this in dev until the problem appears and then look at the irods-test logs. However, I haven't yet encountered this in dev. Maybe echo the iRODS log into the CI output?

mikkonie commented 3 days ago

Possibly related: my latest local test run resulted with a bunch of timeouts messages logged while running tests in landingzones.tests.test_views_taskflow.TestZoneMoveView:

ERROR:irods.connection:Unable to send message. Connection to remote host may have closed. Releasing connection from pool.

However, all the tests executed successfully. After running the tests, an expected amount of sessions were open in irods-test and there was nothing in the irods.log about any errors.

I should probably set the log level on the test server to debug (trace would most likely be overkill) and see if I can spot anything when this does fail..

mikkonie commented 20 hours ago

I'm setting this as ongoing for a now, as it seems to be very rare (observed only twice so far) and hasn't happened in a while.

If similar server freezeups are observed in staging or production, this will of course get re-prioritized higher.