Open mikkonie opened 1 month ago
Example of one of the many identical test failures:
======================================================================
ERROR: test_update_parent_override (taskflowbackend.tests.test_plugins.TestPerformRoleModify.test_update_parent_override)
Test updating overriding parent assignment
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 326, in server_version
return tuple(ast.literal_eval(reported_vsn))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 64, in literal_eval
node_or_string = parse(node_or_string.lstrip(" \t"), mode='eval')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/ast.py", line 50, in parse
return compile(source, filename, mode, flags,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<unknown>", line 0
SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 332, in __server_version
conn = next(iter(self.pool.active))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 70, in get_connection
conn = self.idle.pop()
^^^^^^^^^^^^^^^
KeyError: 'pop from an empty set'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 470, in setUp
super().setUp()
File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/test_plugins.py", line 42, in setUp
super().setUp()
File "/home/runner/work/sodar-server/sodar-server/taskflowbackend/tests/base.py", line 265, in setUp
self.irods = self.irods_backend.get_session_obj()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 473, in get_session_obj
return self._init_irods()
^^^^^^^^^^^^^^^^^^
File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 115, in _init_irods
raise ex
File "/home/runner/work/sodar-server/sodar-server/irodsbackend/api.py", line 102, in _init_irods
irods.collections.exists(
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 87, in exists
self.get(path)
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/manager/collection_manager.py", line 22, in get
query = self.sess.query(Collection).filter(*filters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 290, in query
return Query(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/query.py", line 45, in __init__
if self.sess.server_version >= col.min_version:
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 328, in server_version
return self.__server_version()
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/session.py", line 335, in __server_version
conn = self.pool.get_connection()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 17, in method_
ret = method(self,*s,**kw)
^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/pool.py", line 87, in get_connection
conn = Connection(self, self.account)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 77, in __init__
auth_module.login(self)
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/auth/native.py", line 3, in login
conn._login_native()
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 617, in _login_native
self.recv()
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/connection.py", line 152, in recv
exc = get_exception_by_code(msg.int_info, err_msg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/irods/exception.py", line 186, in get_exception_by_code
exc_class = iRODSExceptionMeta.codes[ rounded_code( code ) ]
~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: -167000
I totally forgot to check out the container output in the CI job shutdown phase. That tells us that iRODS is correctly initialized, but we have a bunch of "too many clients" errors from Postgres:
FATAL: sorry, too many clients already
This would suggest something is trying to open too many connections with postgres. I've seen something similar when I had bugs in closing connections from the python-irodsclient and iRODS sessions were left open. However, those bugs have long since been fixed so this should not be an issue. And if it was, it would show up consistently.
I don't think the Django server would end up in any situation where it would flood postgres with open connections. At least I have never encountered that, even with heavy use in production.
It is also possible that this is the symptom, not the cause: something else goes wrong and postgres ends up being flooded with opened connections?
Full log of the failed job here: ci_irods_crash.txt
I'll have to look into whether some clues can be derived from this. It would appear like none of the taskflow tests succeed. Perhaps they fail in a way which also leaves postgres connections open.
Another thing of note is the following error I don't recall encountering before. irods.message.Bad_AVU_Field: AVU value ('') is zero-length.
That could be a symptom and not the problem though. I'll have to look at this log file in detail later, currently too busy with other stuff.
I'm looking into the full crash log in detail, writing down some facts and thoughts here.
irodsadmin
tests succeed
tearDown()
method in these tests looks OK at first glance
irodsbackend.tests.test_api
succeed
irodsbackend.tests.test_api_taskflow.TestIrodsBackendAPITaskflow.test_get_delete_ticket
irodsadmin
tests (which all succeeded)setUp()
, so what is in the actual test case doesn't matterCRITICAL:irods.connection:Read 0 bytes from socket instead of expected 4 bytes
setUpClass
errors, haven't seen those before..projectroles.models.Project.DoesNotExist: Project matching query does not exist.
make_project_taskflow()
failed, in this case probably due to not being able to initiate iRODS session.irods.exception.UserDoesNotExist
irodsbackend.tests.test_signals_taskflow.TestCreateIrodsUser.test_create
KeyError: -1213000
conn.recv()
, an exception is raised but with a code the python-irodsclient apparently doesn't understand?taskflowbackend.api.TaskflowAPI.FlowSubmitException: Error running flow: -1213000
KeyError: -167000
SyntaxError
to be raised from the python-irodsclientdjango.db.utils.OperationalError: connection to server at "127.0.0.1", port 5432 failed: FATAL: sorry, too many clients already
python-irodsclient
Since I currently have no way of reproducing this, I'll have to wait for this to happen again and compare the log to this one.
Another possibility would be to try running this in dev until the problem appears and then look at the irods-test
logs. However, I haven't yet encountered this in dev. Maybe echo the iRODS log into the CI output?
Possibly related: my latest local test run resulted with a bunch of timeouts messages logged while running tests in landingzones.tests.test_views_taskflow.TestZoneMoveView
:
ERROR:irods.connection:Unable to send message. Connection to remote host may have closed. Releasing connection from pool.
However, all the tests executed successfully. After running the tests, an expected amount of sessions were open in irods-test
and there was nothing in the irods.log
about any errors.
I should probably set the log level on the test server to debug
(trace
would most likely be overkill) and see if I can spot anything when this does fail..
I'm setting this as ongoing
for a now, as it seems to be very rare (observed only twice so far) and hasn't happened in a while.
If similar server freezeups are observed in staging or production, this will of course get re-prioritized higher.
Earlier today I upgraded to
python-irodsclient==2.2.0
(#2023), mainly because the release was supposed to fix the redirect problem I was previously expecting.The initial test run worked fine. However, a subsequent commit which only contained documentation changes had a large number (all?) of iRODS tests failing with connection errors. Example in comments below.
My guesses:
SyntaxError
is probably not intentional at leastI'll try to gather more information and possibly contact iRODS support.