expaso / hassos-addon-timescaledb

A HomeAssistant add-on containing PostgreSQL, psql client and TimeScaleDb
Apache License 2.0
50 stars 14 forks source link

LTSS makes TimescaleDB crash after upgrade to Home Assistant 2024.8.1 #46

Closed balk77 closed 1 week ago

balk77 commented 1 month ago

I just upgraded to Home Assistant 2024.8.1 (skipped .0) from 2024.7.2. Uppon load of the LTSS integration, TimeScaleDB restarts. This makes the HA recorder losing connection. I am not sure where this issue belongs as it relates to three products

TimeScaleDB log:

[15:08:06] INFO: Starting PostgreSQL..
[15:08:06] INFO: done
s6-rc: info: service legacy-services successfully started
2024-08-11 13:08:06.248 UTC [242] LOG:  starting PostgreSQL 16.2 on aarch64-alpine-linux-musl, compiled by gcc (Alpine 13.2.1_git20231014) 13.2.1 20231014, 64-bit
2024-08-11 13:08:06.248 UTC [242] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-08-11 13:08:06.249 UTC [242] LOG:  listening on IPv6 address "::", port 5432
2024-08-11 13:08:06.253 UTC [242] LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
2024-08-11 13:08:06.262 UTC [251] LOG:  database system was shut down at 2024-08-11 13:08:00 UTC
2024-08-11 13:08:06.282 UTC [242] LOG:  database system is ready to accept connections
2024-08-11 13:08:06.284 UTC [254] LOG:  TimescaleDB background worker launcher connected to shared catalogs
[15:08:07] INFO: Starting PgAgent..
[15:08:07] INFO: Create database if not exist: 'homeassistant'
[15:08:07] INFO: Create database if not exist: 'ha_ltss'
[15:08:07] INFO: done
[15:08:07] INFO: done
[15:08:07] NOTICE: TimescaleDb is running!
2024-08-11 13:10:55.164 UTC [242] LOG:  server process (PID 385) was terminated by signal 11: Segmentation fault
2024-08-11 13:10:55.164 UTC [242] DETAIL:  Failed process was running: ALTER TABLE states ALTER event_id TYPE BIGINT, ALTER old_state_id TYPE BIGINT, ALTER attributes_id TYPE BIGINT, ALTER metadata_id TYPE BIGINT
2024-08-11 13:10:55.164 UTC [242] LOG:  terminating any other active server processes
2024-08-11 13:10:55.182 UTC [242] LOG:  all server processes terminated; reinitializing
2024-08-11 13:10:55.223 UTC [439] LOG:  database system was interrupted; last known up at 2024-08-11 13:08:06 UTC
2024-08-11 13:10:56.204 UTC [439] LOG:  database system was not properly shut down; automatic recovery in progress
2024-08-11 13:10:56.218 UTC [439] LOG:  redo starts at 74/B0339790
2024-08-11 13:10:56.335 UTC [439] LOG:  invalid record length at 74/B0B86448: expected at least 24, got 0
2024-08-11 13:10:56.335 UTC [439] LOG:  redo done at 74/B0B86420 system usage: CPU: user: 0.06 s, system: 0.04 s, elapsed: 0.11 s
2024-08-11 13:10:56.352 UTC [440] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-08-11 13:10:56.435 UTC [440] LOG:  checkpoint complete: wrote 1065 buffers (3.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.043 s, sync=0.029 s, total=0.086 s; sync files=91, longest=0.006 s, average=0.001 s; distance=8499 kB, estimate=8499 kB; lsn=74/B0B86448, redo lsn=74/B0B86448
2024-08-11 13:10:56.445 UTC [242] LOG:  database system is ready to accept connections
2024-08-11 13:10:56.449 UTC [444] LOG:  TimescaleDB background worker launcher connected to shared catalogs
Sun Aug 11 15:10:57 2024 WARNING: pgAgent Query error: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
 with status: PGRES_FATAL_ERROR Connection Error: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

Sun Aug 11 15:10:57 2024 ERROR: Failed to query jobs table!
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
[15:10:57] INFO: Service code-server exited with code 1 (by signal 0)
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service pgagent: stopping
s6-rc: info: service pgagent successfully stopped
s6-rc: info: service postgres: stopping
[15:10:57] INFO: Service code-server exited with code 256 (by signal 15)
s6-rc: info: service postgres successfully stopped
s6-rc: info: service init-addon: stopping
s6-rc: info: service init-addon successfully stopped
s6-rc: info: service init-user: stopping
s6-rc: info: service init-user successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service base-addon-log-level: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service base-addon-log-level successfully stopped
s6-rc: info: service base-addon-banner: stopping
s6-rc: info: service base-addon-banner successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
2024-08-11 13:10:57.613 UTC [242] LOG:  received smart shutdown request
2024-08-11 13:10:57.613 UTC [444] FATAL:  terminating background worker "TimescaleDB Background Worker Launcher" due to administrator command
2024-08-11 13:10:57.613 UTC [447] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2024-08-11 13:10:57.613 UTC [451] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2024-08-11 13:10:57.623 UTC [242] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 444) exited with exit code 1
2024-08-11 13:10:57.623 UTC [242] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 447) exited with exit code 1
2024-08-11 13:10:57.623 UTC [242] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 451) exited with exit code 1
2024-08-11 13:10:57.625 UTC [242] LOG:  background worker "logical replication launcher" (PID 445) exited with exit code 1
2024-08-11 13:10:57.625 UTC [440] LOG:  shutting down
2024-08-11 13:10:57.628 UTC [440] LOG:  checkpoint starting: shutdown immediate
2024-08-11 13:10:57.640 UTC [440] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.003 s, total=0.015 s; sync files=2, longest=0.003 s, average=0.002 s; distance=8 kB, estimate=7650 kB; lsn=74/B0B88470, redo lsn=74/B0B88470
2024-08-11 13:10:57.655 UTC [242] LOG:  database system is shut down

LTSS related log:

This error originated from a custom integration.

Logger: custom_components.ltss
Source: custom_components/ltss/__init__.py:220
integration: Long Time State Storage (LTSS) (documentation)
First occurred: 15:10:56 (3 occurrences)
Last logged: 15:11:03

Error in database connectivity: (psycopg2.OperationalError) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)
Error in database connectivity: (psycopg2.OperationalError) connection to server at "192.168.1.7", port 5432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)

HA database related log:

This error originated from a custom integration.

Logger: custom_components.ltss
Source: custom_components/ltss/__init__.py:220
integration: Long Time State Storage (LTSS) (documentation)
First occurred: 15:10:56 (3 occurrences)
Last logged: 15:11:03

Error in database connectivity: (psycopg2.OperationalError) server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)
Error in database connectivity: (psycopg2.OperationalError) connection to server at "192.168.1.7", port 5432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds)

Migration error:

Logger: homeassistant.components.recorder.core
Source: components/recorder/core.py:988
integration: Recorder (documentation, issues)
First occurred: 15:10:56 (1 occurrences)
Last logged: 15:10:56

Error during schema migration
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/recorder/core.py", line 988, in _migrate_schema
    new_schema_status = migrator(
                        ^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 335, in migrate_schema_live
    schema_status = _migrate_schema(
                    ^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 301, in _migrate_schema
    _apply_update(instance, hass, engine, session_maker, new_version, start_version)
  File "/usr/src/homeassistant/homeassistant/components/recorder/util.py", line 703, in wrapper
    job(instance, *args, **kwargs)
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 712, in _apply_update
    migrator_cls(instance, hass, engine, session_maker, old_version).apply_update()
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 758, in apply_update
    self._apply_update()
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 1492, in _apply_update
    _modify_columns(
  File "/usr/src/homeassistant/homeassistant/components/recorder/migration.py", line 559, in _modify_columns
    with session_scope(session=session_maker()) as session:
  File "/usr/local/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/src/homeassistant/homeassistant/helpers/recorder.py", line 104, in session_scope
    session.commit()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2017, in commit
    trans.commit(_to_root=True)
  File "<string>", line 2, in commit
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/state_changes.py", line 139, in _go
    ret_value = fn(self, *arg, **kw)
                ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1309, in commit
    trans.commit()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2629, in commit
    self._do_commit()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2734, in _do_commit
    self._connection_commit_impl()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2705, in _connection_commit_impl
    self.connection._commit_impl()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1146, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2356, in _handle_dbapi_exception
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1144, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
                                  ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 585, in connection
    return self._revalidate_connection()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 677, in _revalidate_connection
    self._invalid_transaction()
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 667, in _invalid_transaction
    raise exc.PendingRollbackError(
sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back.  Please rollback() fully before proceeding (Background on this error at: https://sqlalche.me/e/20/8s2b)
balk77 commented 3 weeks ago

I have retried with LTSS disabeled (start with 2024.7.2 --> disable LTSS in config --> upgrade). Same error occurs. This excludes LTSS from the equation.

expaso commented 3 weeks ago

Hi @balk77 ,

I am looking into this issue right now..

expaso commented 3 weeks ago

This looks like a bug in Timescale itself..

I have successfully upgraded the database from 2024.7.1 to 2024.8.2 just yet, and had no problems, but the segfault I see in your logs tells me that, with perhaps other compression schemes, it fails for some users.

I assume you are already use the lastest version of the timescaledb addon?

Either way, I will prepare a new version of the addon with upgraded TimescaleDb.

Stay close..

balk77 commented 3 weeks ago

thanks a lot! Forgot to mention indeed; addon version 4.0.1. Strange that you don't experience this issue. Maybe my db is too large? 15GB for HA and 10GB for LTSS. I need to clean up.

Let me know if you need more info or testing

expaso commented 3 weeks ago

thanks a lot! Forgot to mention indeed; addon version 4.0.1. Strange that you don't experience this issue. Maybe my db is too large? 15GB for HA and 10GB for LTSS. I need to clean up.

Let me know if you need more info or testing

The size is not shocking, I have 40GB running on a raspberry pi, so that should no be an issue. I am currently building (takes a few hours to compile everything) and will the do some more testing..

skank01 commented 3 weeks ago

This looks like a bug in Timescale itself..

I have successfully upgraded the database from 2024.7.1 to 2024.8.2 just yet, and had no problems, but the segfault I see in your logs tells me that, with perhaps other compression schemes, it fails for some users.

I assume you are already use the lastest version of the timescaledb addon?

Either way, I will prepare a new version of the addon with upgraded TimescaleDb.

Stay close..

Just to inform, i dont have any errors either right now, (HA version 2024.8.2 and latest Timescale addon)

expaso commented 3 weeks ago

This looks like a bug in Timescale itself.. I have successfully upgraded the database from 2024.7.1 to 2024.8.2 just yet, and had no problems, but the segfault I see in your logs tells me that, with perhaps other compression schemes, it fails for some users. I assume you are already use the lastest version of the timescaledb addon? Either way, I will prepare a new version of the addon with upgraded TimescaleDb. Stay close..

Just to inform, i dont have any errors either right now, (HA version 2024.8.2 and latest Timescale addon)

Hi @skank01, oh thats nice to know. That releases the pressure of this issue a bit. I am currently preparing an update, because I saw Timescale had fixed a few segfaults on their part. Either way, glad you raised the issue.

expaso commented 1 week ago

Version 4.1.0 should fix the above.

balk77 commented 1 week ago

Thanks for the update! But it did not help :( I may have another issue.

image

I will start recovery, migrate the HA database to a fresh plain PGSQL server and try to get LTSS back healthy.

expaso commented 1 day ago

Have you already succeeded in this, or do you need any assistance?

balk77 commented 1 day ago

Thanks for checking. None of my recent backups did work; I lost the states table. I was surprised that I could import the statistics tables. So I lost nothing important. At least on the HA side. I did not try to export and import the ltss and dsmrreader db's

Not sure what caused the corruption. Maybe I should upgrade my hardware.....

Op wo 11 sep 2024 20:23 schreef Expaso @.***>:

Have you already succeeded in this, or do you need any assistance?

— Reply to this email directly, view it on GitHub https://github.com/expaso/hassos-addon-timescaledb/issues/46#issuecomment-2344379021, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNSATQ6WAG37THRXBQ27FTZWCDARAVCNFSM6AAAAABMKYSMOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBUGM3TSMBSGE . You are receiving this because you were mentioned.Message ID: @.***>