galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.39k stars 999 forks source link

Specific step in workflow fails due to burst in DB memory usage #15244

Open pcm32 opened 1 year ago

pcm32 commented 1 year ago

Describe the bug

One of the last steps of our workflow for some reason triggers some very disruptive issues with the database.

On the workflow handler I see:

galaxy.workflow.scheduling_manager DEBUG 2022-12-20 02:08:14,256 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Workflow invocation [1] scheduled
galaxy.workflow.scheduling_manager DEBUG 2022-12-20 02:08:15,264 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Attempting to schedule workflow invocation [(1,)]
galaxy.managers.collections DEBUG 2022-12-20 02:08:16,112 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Created collection with 9 elements
galaxy.tools.actions.model_operations INFO 2022-12-20 02:08:16,124 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Calling produce_outputs, tool is <galaxy.tools.FilterFailedDatasetsTool object at 0x7f9839a77040>
galaxy.tools.execute DEBUG 2022-12-20 02:08:16,124 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Tool __FILTER_FAILED_DATASETS__ created job None (350.924 ms)
galaxy.web_stack.handlers INFO 2022-12-20 02:08:20,837 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] (Job[id=98,tool_id=__FILTER_FAILED_DATASETS__]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.tools.execute DEBUG 2022-12-20 02:08:20,843 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Executed 1 job(s) for tool __FILTER_FAILED_DATASETS__ request (5103.050 ms)
galaxy.workflow.run DEBUG 2022-12-20 02:08:20,853 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Workflow step 108 of invocation 1 invoked (5114.409 ms)
galaxy.tools.actions INFO 2022-12-20 02:08:21,661 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Handled output named output_h5ad for tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/anndata_ops/anndata_ops/1.8.1+2+galaxy0 (0.820 ms)
galaxy.tools.actions INFO 2022-12-20 02:08:21,667 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Added output datasets to history (5.261 ms)
galaxy.tools.actions INFO 2022-12-20 02:08:21,676 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Setup for job Job[unflushed,tool_id=toolshed.g2.bx.psu.edu/repos/ebi-gxa/anndata_ops/anndata_ops/1.8.1+2+galaxy0] complete, ready to be enqueued (9.483 ms)
galaxy.tools.execute DEBUG 2022-12-20 02:08:21,677 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/anndata_ops/anndata_ops/1.8.1+2+galaxy0 created job None (789.909 ms)
galaxy.workflow.run ERROR 2022-12-20 02:08:23,877 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_executemany(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 953, in do_executemany
    context._psycopg2_fetched_rows = xtras.execute_values(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/extras.py", line 1270, in execute_values
    cur.execute(b''.join(parts))
psycopg2.OperationalError: SSL SYSCALL error: EOF detected

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 163, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 232, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(
  File "/galaxy/server/lib/galaxy/workflow/modules.py", line 1976, in execute
    execution_tracker = execute(
  File "/galaxy/server/lib/galaxy/tools/execute.py", line 166, in execute
    trans.sa_session.flush()
  File "<string>", line 2, in flush
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3383, in flush
    self._flush(objects)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3522, in _flush
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush
    flush_context.execute()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 453, in execute
    n.execute_aggregate(self, set_)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 555, in execute_aggregate
    self.execute(uow)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    _emit_insert_statements(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1156, in _emit_insert_statements
    c = connection._execute_20(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1799, in _execute_context
    self.dialect.do_executemany(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 953, in do_executemany
    context._psycopg2_fetched_rows = xtras.execute_values(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/extras.py", line 1270, in execute_values
    cur.execute(b''.join(parts))
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected

[SQL: INSERT INTO history_dataset_association_history (history_dataset_association_id, update_time, version, name, extension, metadata, extended_metadata_id) VALUES (%(history_dataset_association_id)s, %(update_time)s, %(version)s, %(name)s, %(extension)s, %(metadata)s, %(extended_metadata_id)s) RETURNING history_dataset_association_history.id]
[parameters: ({'history_dataset_association_id': 225, 'update_time': datetime.datetime(2022, 12, 20, 1, 44, 48, 189043), 'version': 1, 'name': 'tsne_perplexity_5.h5ad', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97e8207d80>, 'extended_metadata_id': None}, {'history_dataset_association_id': 219, 'update_time': datetime.datetime(2022, 12, 20, 1, 44, 48, 189029), 'version': 1, 'name': 'tsne_perplexity_15.h5ad', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97ea06d620>, 'extended_metadata_id': None}, {'history_dataset_association_id': 239, 'update_time': datetime.datetime(2022, 12, 20, 2, 8, 16, 908002), 'version': 1, 'name': 'Scanpy FindMarkers on data 148 and data 184: Marker genes AnnData', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97e8280750>, 'extended_metadata_id': None}, {'history_dataset_association_id': 235, 'update_time': datetime.datetime(2022, 12, 20, 1, 44, 48, 189072), 'version': 1, 'name': 'Scanpy RunUMAP on data 124: UMAP object AnnData', 'extension': 'h5', 'metadata': <psycopg2.extensions.Binary object at 0x7f97e8282c70>, 'extended_metadata_id': None}, {'history_dataset_association_id': 236, 'update_time': datetime.datetime(2022, 12, 20, 2, 8, 16, 907995), 'version': 1, 'name': 'Scanpy FindMarkers on data 145 and data 181: Marker genes AnnData', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97ca6a0d50>, 'extended_metadata_id': None}, {'history_dataset_association_id': 242, 'update_time': datetime.datetime(2022, 12, 20, 2, 8, 16, 908008), 'version': 1, 'name': 'Scanpy FindMarkers on data 151 and data 187: Marker genes AnnData', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97ea06db30>, 'extended_metadata_id': None}, {'history_dataset_association_id': 27, 'update_time': datetime.datetime(2022, 12, 20, 0, 30, 27, 895300), 'version': 3, 'name': 'Scanpy FilterGenes on data 13: Filtered cells AnnData', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97e8204990>, 'extended_metadata_id': None}, {'history_dataset_association_id': 237, 'update_time': datetime.datetime(2022, 12, 20, 2, 8, 16, 907998), 'version': 1, 'name': 'Scanpy FindMarkers on data 146 and data 182: Marker genes AnnData', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97e9e1e700>, 'extended_metadata_id': None}  ... displaying 10 of 32 total bound parameter sets ...  {'history_dataset_association_id': 217, 'update_time': datetime.datetime(2022, 12, 20, 1, 44, 48, 189023), 'version': 1, 'name': 'tsne_perplexity_1.h5ad', 'extension': 'h5ad', 'metadata': <psycopg2.extensions.Binary object at 0x7f97ca6a36f0>, 'extended_metadata_id': None}, {'history_dataset_association_id': 227, 'update_time': datetime.datetime(2022, 12, 20, 1, 44, 48, 189049), 'version': 1, 'name': 'Scanpy RunUMAP on data 116: UMAP object AnnData', 'extension': 'h5', 'metadata': <psycopg2.extensions.Binary object at 0x7f97ca6a3840>, 'extended_metadata_id': None})]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 480, in checkout
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 478, in checkout
    dbapi_connection = rec.get_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 636, in get_connection
    self.__connect()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  the database system is in recovery mode

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 42, in __invoke
    outputs = invoker.invoke()
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 176, in invoke
    self.workflow_invocation.workflow.log_str(),
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 481, in __get__
    return self.impl.get(state, dict_)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 941, in get
    value = self._fire_loader_callables(state, key, passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 977, in _fire_loader_callables
    return self.callable_(state, passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/strategies.py", line 871, in _load_for_state
    primary_key_identity = self._get_ident_for_use_get(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/strategies.py", line 930, in _get_ident_for_use_get
    return [
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/strategies.py", line 931, in <listcomp>
    get_attr(state, dict_, self._equated_columns[pk], passive=passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/mapper.py", line 2920, in _get_state_attr_by_column
    return state.manager[prop.key].impl.get(state, dict_, passive=passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 941, in get
    value = self._fire_loader_callables(state, key, passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/attributes.py", line 972, in _fire_loader_callables
    return state._load_expired(state, passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 710, in _load_expired
    self.manager.expired_attribute_loader(self, toload, passive)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 1451, in load_scalar_attributes
    result = load_on_ident(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 407, in load_on_ident
    return load_on_pk_identity(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
    session.execute(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1706, in execute
    conn = self._connection_for_bind(bind, close_with_result=True)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1558, in _connection_for_bind
    conn = engine.connect(**kw)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3234, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3313, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3283, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2117, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 480, in checkout
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 478, in checkout
    dbapi_connection = rec.get_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 636, in get_connection
    self.__connect()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  the database system is in recovery mode

(Background on this error at: https://sqlalche.me/e/14/e3q8)
galaxy.workflow.scheduling_manager DEBUG 2022-12-20 02:08:24,821 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Workflow invocation [1] scheduled

at the end it says again it is scheduling, but it doesn't or workflow comes back as failed. In the meantime, the job handler shows:

galaxy.tool_util.deps.container_resolvers.mulled DEBUG 2022-12-20 02:06:10,282 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] Image name for tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_find_markers/scanpy_find_markers/1.8.1+galaxy0: scanpy-scripts:1.1.0
galaxy.tool_util.deps.containers INFO 2022-12-20 02:06:10,304 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] Checking with container resolver [MulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/scanpy-scripts:1.1.0--pyhdfd78af_1,type=docker]]
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:06:10,589 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] PP Job is running..
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:06:34,552 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] PP Job is failed and not deleted, looking at failure
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:06:34,553 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] PP Trying with error file in _handle_job_failure
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:06:34,565 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] PP OOM reached!
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:06:34,623 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (95/gxy-galaxy-dev2-89954) Terminated at user's request
galaxy.jobs.runners.kubernetes DEBUG 2022-12-20 02:08:07,975 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] PP Job succeeded
galaxy.jobs.runners DEBUG 2022-12-20 02:08:08,231 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] executing external set_meta script for job 94: GALAXY_LIB="/galaxy/server/lib"; if [ "$GALAXY_LIB" != "None" ]; then if [ -n "$PYTHONPATH" ]; then PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"; else PYTHONPATH="$GALAXY_LIB"; fi; export PYTHONPATH; fi; GALAXY_VIRTUAL_ENV="None"; if [ "$GALAXY_VIRTUAL_ENV" != "None" -a -z "$VIRTUAL_ENV" -a -f "$GALAXY_VIRTUAL_ENV/bin/activate" ]; then . "$GALAXY_VIRTUAL_ENV/bin/activate"; fi; python "metadata/set.py"
galaxy.jobs.runners DEBUG 2022-12-20 02:08:15,387 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] execution of external set_meta for job 94 finished
galaxy.model.metadata DEBUG 2022-12-20 02:08:15,446 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 204
galaxy.model.metadata DEBUG 2022-12-20 02:08:15,458 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] loading metadata from file for: HistoryDatasetAssociation 205
galaxy.jobs INFO 2022-12-20 02:08:15,733 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] Collecting metrics for Job 94 in /galaxy/server/database/jobs_directory/000/94
galaxy.jobs DEBUG 2022-12-20 02:08:15,883 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] job_wrapper.finish for job 94 executed (464.900 ms)
galaxy.jobs.handler ERROR 2022-12-20 02:08:23,741 [pN:job_handler_0,p:8,tN:JobHandlerStopQueue.monitor_thread] Exception in monitor_step
Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.OperationalError: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/handler.py", line 1097, in monitor
    self.monitor_step()
  File "/galaxy/server/lib/galaxy/jobs/handler.py", line 1139, in monitor_step
    .all()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2768, in all
    return self._iter().all()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2903, in _iter
    result = self.session.execute(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1712, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
    self._handle_dbapi_exception(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

[SQL: SELECT job.id AS job_id, job.create_time AS job_create_time, job.update_time AS job_update_time, job.history_id AS job_history_id, job.library_folder_id AS job_library_folder_id, job.tool_id AS job_tool_id, job.tool_version AS job_tool_version, job.galaxy_version AS job_galaxy_version, job.dynamic_tool_id AS job_dynamic_tool_id, job.state AS job_state, job.info AS job_info, job.copied_from_job_id AS job_copied_from_job_id, job.command_line AS job_command_line, job.dependencies AS job_dependencies, job.job_messages AS job_job_messages, job.param_filename AS job_param_filename, job.runner_name AS job_runner_name_1, job.job_stdout AS job_job_stdout, job.job_stderr AS job_job_stderr, job.tool_stdout AS job_tool_stdout, job.tool_stderr AS job_tool_stderr, job.exit_code AS job_exit_code, job.traceback AS job_traceback, job.session_id AS job_session_id, job.user_id AS job_user_id, job.job_runner_name AS job_job_runner_name, job.job_runner_external_id AS job_job_runner_external_id, job.destination_id AS job_destination_id, job.destination_params AS job_destination_params, job.object_store_id AS job_object_store_id, job.imported AS job_imported, job.params AS job_params, job.handler AS job_handler, EXISTS (SELECT history_dataset_collection_association.id
FROM history_dataset_collection_association, job_to_output_dataset_collection
WHERE job.id = job_to_output_dataset_collection.job_id AND history_dataset_collection_association.id = job_to_output_dataset_collection.dataset_collection_id AND history_dataset_collection_association.deleted = true) AS anon_1, EXISTS (SELECT history_dataset_association.metadata, history_dataset_association.id, history_dataset_association.history_id, history_dataset_association.dataset_id, history_dataset_association.create_time, history_dataset_association.update_time, history_dataset_association.state, history_dataset_association.copied_from_history_dataset_association_id, history_dataset_association.copied_from_library_dataset_dataset_association_id, history_dataset_association.name, history_dataset_association.info, history_dataset_association.blurb, history_dataset_association.peek, history_dataset_association.tool_version, history_dataset_association.extension, history_dataset_association.metadata_deferred, history_dataset_association.parent_id, history_dataset_association.designation, history_dataset_association.deleted, history_dataset_association.visible, history_dataset_association.extended_metadata_id, history_dataset_association.version, history_dataset_association.hid, history_dataset_association.purged, history_dataset_association.validated_state, history_dataset_association.validated_state_message, history_dataset_association.hidden_beneath_collection_instance_id
FROM history_dataset_association, job_to_output_dataset
WHERE job.id = job_to_output_dataset.job_id AND history_dataset_association.id = job_to_output_dataset.dataset_id AND history_dataset_association.deleted = true) AS anon_2
FROM job
WHERE job.state IN (%(state_1_1)s, %(state_1_2)s, %(state_1_3)s) AND job.handler = %(handler_1)s]
[parameters: {'handler_1': 'job_handler_0', 'state_1_1': <states.DELETED_NEW: 'deleted_new'>, 'state_1_2': <states.DELETING: 'deleting'>, 'state_1_3': <states.STOPPING: 'stop'>}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
galaxy.jobs.handler ERROR 2022-12-20 02:08:24,091 [pN:job_handler_0,p:8,tN:JobHandlerQueue.monitor_thread] Exception in monitor_step
Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 480, in checkout
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 478, in checkout
    dbapi_connection = rec.get_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 636, in get_connection
    self.__connect()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  the database system is in recovery mode

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/handler.py", line 369, in __monitor
    self.__monitor_step()
  File "/galaxy/server/lib/galaxy/jobs/handler.py", line 385, in __monitor_step
    self.job_grabber.grab_unhandled_items()
  File "/galaxy/server/lib/galaxy/jobs/handler.py", line 177, in grab_unhandled_items
    conn = self.sa_session.connection(execution_options=self._grab_conn_opts)
  File "<string>", line 2, in connection
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1542, in connection
    return self._connection_for_bind(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1558, in _connection_for_bind
    conn = engine.connect(**kw)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3234, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3313, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3283, in _wrap_pool_connect
    Connection._handle_dbapi_exception_noconnection(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2117, in _handle_dbapi_exception_noconnection
    util.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3280, in _wrap_pool_connect
    return fn()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 310, in connect
    return _ConnectionFairy._checkout(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 868, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 480, in checkout
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 478, in checkout
    dbapi_connection = rec.get_connection()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 636, in get_connection
    self.__connect()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 665, in __connect
    with util.safe_reraise():
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 661, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 590, in connect
    return dialect.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 597, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  the database system is in recovery mode

(Background on this error at: https://sqlalche.me/e/14/e3q8)
galaxy.util.task DEBUG 2022-12-20 03:05:51,320 [pN:job_handler_0,p:8,tN:HistoryAuditTablePruneTask] Executed periodic task HistoryAuditTablePruneTask (8.007 ms)

Last week I though that this was due to the database disk being nearly full, however, this is happening with a nearly empty disk now:

root@galaxy-galaxy-dev2-postgres-0:/home/postgres/pgdata# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdd         20G  484M   20G   3% /home/postgres/pgdata

Based on the time, it could be that the maintenance scripts where running at that time as well (around 2:05 AM), although this seems more like a coincidence as last week it also failed at the same place at a completely different time of the day.

The database container shows the following on its log around that time:

2022-12-20 02:04:43.971 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:04:51,420 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:01,400 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:11,386 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:14.000 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:05:21,415 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:31,445 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:41,396 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:05:44.027 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:05:51,440 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:01,412 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:11,417 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:14.035 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:06:21,417 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:31,422 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:41,405 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:06:44.051 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:06:51,405 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:01,403 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:11,409 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:14.070 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:07:21,386 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:31,344 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:41,411 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:07:44.071 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:07:51,414 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:01,361 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:11,413 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:14.075 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:08:21,404 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:31,690 INFO: Lock owner: galaxy-galaxy-dev2-postgres-0; I am galaxy-galaxy-dev2-postgres-0
2022-12-20 02:08:32,035 INFO: establishing a new patroni connection to the postgres cluster
2022-12-20 02:08:32,143 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:41,423 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:08:44.108 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:08:51,405 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:01,421 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:11,416 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:14.128 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:09:21,348 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:31,420 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:41,415 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock
2022-12-20 02:09:44.151 34 LOG {ticks: 0, maint: 0, retry: 0}
2022-12-20 02:09:51,418 INFO: no action. I am (galaxy-galaxy-dev2-postgres-0), the leader with the lock

At 2:08 it shows like there was a bump in the connection. There doesn't seem to be any abnormality due to the maintenance job at 02:05 or so. There is no restart of the pod at that time for postgres, but you can see a peak in memory usage and CPU:

image

however it says it only reaches 70% of memory utilisation (and k8s shows no signs of OOMs for that container). Perhaps a failsafe within the pod for too much memory being acquired?

Galaxy Version and/or server at which you observed the bug Galaxy Version: 22.05 Commit: 3b068eecb9f9573a90b51f3eeb1bc06d1d6eb88d

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

The workflow should finish correctly or fail, but probably not due to the above database error.

Screenshots

This is what can be seen from the UI for the workflow invocation:

image

Additional context

This is running using the Galaxy Helm chart, which provisions postgres through an operator (perhaps this only happens there? as I haven't seen this issue in other instances).

pcm32 commented 1 year ago

Reproducing the same tool call with the same inputs, I see again a smaller peak (to 55% of memory) for postgres, and the same errors on the logs regarding the database being in recovery mode.

pcm32 commented 1 year ago

Changing the postgres Statefulset to use 1.5 Gi of RAM instead of the original 500Mi makes this work. I find mysterious that

Apologies, happy to move this issue to Galaxy Helm as it is a particularity of that setup.

pcm32 commented 1 year ago

I think that Galaxy is putting an awful load of data on the database for some complex tool executions. Since what I posted above, I have re-run the same tool (which takes 3 different collections, 1 of them in two separate inputs) and the size usage of the database has grown considerably (and I'm not running anything else):

root@galaxy-galaxy-dev2-postgres-0:/home/postgres/pgdata# df -h .
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb         20G  1.1G   19G   6% /home/postgres/pgdata

so it has doubled with only 6 or 7 more executions of the same tool (due to resubmissions).

bgruening commented 1 year ago

Can you share this tool? Can you maybe narrow down the query that is causing this behavior? Maybe by turning on the slow query logging in galaxy.yml?

pcm32 commented 1 year ago

I think that the table receiving the increase is history_dataset_association_history. After some 10 or so executions of the same workflow, db disk space looks like this:

Filesystem      Size  Used Avail Use% Mounted on
/dev/vdb         20G  2.0G   18G  10% /home/postgres/pgdata

tool generating this is https://humancellatlas.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/ebi-gxa/anndata_ops/anndata_ops/1.8.1+3+galaxy0 in a usage where you activate entrypoints for various collections of AnnData files:

image

where in this case inputs 2 to 6 are collections (usually of around 10 to 20 datasets). I will try the slow query logging as suggested.