Postgres server won't start

malewpro commented 10 months ago

Issue Summary

My self-hosted Redash instance won't fully boot and it appears the cause is an error in the Postgres container which causes the database to try to start, fail, and restart in a constant loop. I am hosting these containers in an Ubuntu VM hosted on a TrueNAS server.

Steps to Reproduce

I am self hosting Redash via a docker stack with the following containers:

redash_adhoc_worker_1
redash_nginx_1
redash_postgres_1
redash_redis_1
redash_scheduled_worker_1
redash_scheduler_1
redash_server_1

I use Redash to combine data from two different SQLite databases by creating queries of each database and then using the internal query feature to combine the two. I created a number of queries for this and had a solid system up and running but sometime over the last two months the redash_server_1 container would need to be reboot in order to access the web UI. Without rebooting the UI would just load. Recently, though, the problem has gotten more significant. Now I can't access the WebUI at all and can no longer use Redash. Rebooting all containers, ensuring that their images are updated, etc... does not work.

Looking at logs for redash_server_1 I get the following error when restarting the stack:


> `[2024-01-02 21:30:33,692][PID:12][INFO][metrics] method=GET path=/ endpoint=redash_index status=302 content_type=text/html; charset=utf-8 content_length=301 duration=0.68 query_count=0 query_duration=0.00
> 
> [2024-01-02 21:30:36,872] ERROR in app: Exception on /login [GET]
> 
> Traceback (most recent call last):
> 
>   File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
> 
>     response = self.full_dispatch_request()
> 
>   File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
> 
>     rv = self.handle_user_exception(e)
> 
>   File "/usr/local/lib/python2.7/site-packages/flask_restful/__init__.py", line 271, in error_router
> 
>     return original_handler(e)
> 
>   File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
> 
>     reraise(exc_type, exc_value, tb)
> 
>   File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
> 
>     rv = self.dispatch_request()
> 
>   File "/usr/local/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
> 
>     return self.view_functions[rule.endpoint](**req.view_args)
> 
>   File "/usr/local/lib/python2.7/site-packages/flask_limiter/extension.py", line 442, in __inner
> 
>     return obj(*a, **k)
> 
>   File "/app/redash/handlers/authentication.py", line 147, in login
> 
>     if current_org == None and not settings.MULTI_ORG:
> 
>   File "/usr/local/lib/python2.7/site-packages/werkzeug/local.py", line 365, in <lambda>
> 
>     __eq__ = lambda x, o: x._get_current_object() == o
> 
>   File "/usr/local/lib/python2.7/site-packages/werkzeug/local.py", line 302, in _get_current_object
> 
>     return self.__local()
> 
>   File "/app/redash/authentication/org_resolving.py", line 18, in _get_current_org
> 
>     g.org = Organization.get_by_slug(slug)
> 
>   File "/app/redash/models/organizations.py", line 33, in get_by_slug
> 
>     return cls.query.filter(cls.slug == slug).first()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2895, in first
> 
>     ret = list(self[0:1])
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2687, in __getitem__
> 
>     return list(res)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2995, in __iter__
> 
>     return self._execute_and_instances(context)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 3016, in _execute_and_instances
> 
>     close_with_result=True)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 3025, in _get_bind_args
> 
>     **kw
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 3007, in _connection_from_session
> 
>     conn = self.session.connection(**kw)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1035, in connection
> 
>     execution_options=execution_options)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1040, in _connection_for_bind
> 
>     engine, execution_options)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 409, in _connection_for_bind
> 
>     conn = bind.contextual_connect()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2123, in contextual_connect
> 
>     self._wrap_pool_connect(self.pool.connect, None),
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2162, in _wrap_pool_connect
> 
>     e, dialect, self)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1476, in _handle_dbapi_exception_noconnection
> 
> ������
>     exc_info
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 265, in raise_from_cause
> 
>     reraise(type(exception), exception, tb=exc_tb, cause=cause)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 2158, in _wrap_pool_connect
> 
>     return fn()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 400, in connect
> 
>     return _ConnectionFairy._checkout(self)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 788, in _checkout
> 
>     fairy = _ConnectionRecord.checkout(pool)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 529, in checkout
> 
>     rec = pool._do_get()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 1193, in _do_get
> 
>     self._dec_overflow()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
> 
>     compat.reraise(exc_type, exc_value, exc_tb)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 1190, in _do_get
> 
>     return self._create_connection()
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 347, in _create_connection
> 
>     return _ConnectionRecord(self)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 474, in __init__
> 
>     self.__connect(first_connect_check=True)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/pool.py", line 671, in __connect
> 
>     connection = pool._invoke_creator(self)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 106, in connect
> 
>     return dialect.connect(*cargs, **cparams)
> 
>   File "/usr/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 412, in connect
> 
>     return self.dbapi.connect(*cargs, **cparams)
> 
>   File "/usr/local/lib/python2.7/site-packages/psycopg2/__init__.py", line 130, in connect
> 
>     conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
> 
> OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
> 
>   Is the server running on host "postgres" (172.18.0.5) and accepting
> 
>   TCP/IP connections on port 5432?
> 
>  (Background on this error at: http://sqlalche.me/e/e3q8)
> 
> [2024-01-02 21:30:36,880][PID:12][INFO][metrics] method=GET path=/login endpoint=redash_login status=500 content_type=? content_length=-1 duration=3179.04 query_count=0 query_duration=0.00`

Looking at the redash_postgres_1 container logs I see that it's stuck in a loop of trying to start the database, encountering an error and restarting:

PostgreSQL Database directory appears to contain a database; Skipping initialization

LOG: database system was interrupted; last known up at 2024-01-02 04:31:22 UTC

LOG: invalid resource manager ID 56 at 13/78B326F8

LOG: invalid primary checkpoint record

LOG: invalid resource manager ID 114 at 13/78B32688

LOG: invalid secondary checkpoint record

PANIC: could not locate a valid checkpoint record

LOG: startup process (PID 21) was terminated by signal 6: Aborted

LOG: aborting startup due to startup process failure

LOG: database system is shut down

Since initially setting up the queries and dashboards in my Redash instance, I have made no changes other than to update the SQLite files and refresh queries. At some point in this process an additional "redash_postgres_1-old" container was created which I have no recollection of. Neither postgres container is able to start a database.

Lastly, I'm mentioning this in case it's a possible cause: I recently learned that the drives I built my TrueNAS server with are SMR drives which are known to cause long read times and some filesystem instability in a server implementation. I have replacement drives arriving this week. Perhaps someone with more expertise than me will be able to identify that as the source of the issue.

Technical details:

Redash Version: redash:latest (can't get into Web UI to see more specifics)
Browser/OS: Firefox on Mac OS
How did you install Redash: Docker Stack

malewpro commented 10 months ago

If nothing else, I'm hoping there is a way to salvage the queries I wrote. I spent a lot of time writing those and would love to not have to recreate them from scratch. I'm not sure where I could find those queries or if they're easy to access outside of the GUI. Thanks for the help in advance!

justinclift commented 10 months ago

PANIC: could not locate a valid checkpoint record

That sounds like the database structure on disk is damaged. :frowning_face:

Probably the very, very best place to ask for assistance is on the PostgreSQL mailing lists:

https://www.postgresql.org/list/

Out of the list of potentials there, the pgsql-general mailing list is probably the right place.

Note that's not me fobbing you off, the PostgreSQL Community is really good at helping out when problems happen in a database.

And once you have the database working again, Redash should be fine. :smile:

Also, take a backup of the complete PostgreSQL database data directory before you do anything, just to be on the safe side. It's always a good idea to have a fall back point so you can try new recovery options out (etc) until you have things working 100%. :smile:

justinclift commented 10 months ago

As a 2nd note, once you get things operating again it's a good idea (practically mandatory) to put some kind of automated process in place to back things up.

TrueNAS has the ability to do timed snapshots of the file system (eg "every hour", "every day", etc) and you can give them a defined retention period (eg "keep for a week", "keep for 6 months", etc) which helps dramatically when stuff goes wrong.

Additionally, it can also do timed jobs to copy those snapshots to another machine, or even to a cloud provider (eg rsync.net) for a potential "off site backup" approach.

Hopefully that'll be useful in future, once you get the immediate problem sorted out. :smile:

malewpro commented 10 months ago

@justinclift Thank you! I'm fairly new to self hosting and have learned a lot over the last six months. I suspected it was a PostrgreSQL issue so it's good to get that affirmed. I'll get on the mailing list and see what can be done.

And I think you're totally right re: backups. Now that I've got things in place in a way that I like, it's all about stability and fallbacks. I'll look into rsync and will make sure I have the right snapshots set up.

Closing this for now and will reply if the PostgreSQL people give me reason to believe it's not a database corruption issue.

getredash / redash