Closed malewpro closed 10 months ago
If nothing else, I'm hoping there is a way to salvage the queries I wrote. I spent a lot of time writing those and would love to not have to recreate them from scratch. I'm not sure where I could find those queries or if they're easy to access outside of the GUI. Thanks for the help in advance!
PANIC: could not locate a valid checkpoint record
That sounds like the database structure on disk is damaged. :frowning_face:
Probably the very, very best place to ask for assistance is on the PostgreSQL mailing lists:
https://www.postgresql.org/list/
Out of the list of potentials there, the pgsql-general
mailing list is probably the right place.
Note that's not me fobbing you off, the PostgreSQL Community is really good at helping out when problems happen in a database.
And once you have the database working again, Redash should be fine. :smile:
Also, take a backup of the complete PostgreSQL database data directory before you do anything, just to be on the safe side. It's always a good idea to have a fall back point so you can try new recovery options out (etc) until you have things working 100%. :smile:
As a 2nd note, once you get things operating again it's a good idea (practically mandatory) to put some kind of automated process in place to back things up.
TrueNAS has the ability to do timed snapshots of the file system (eg "every hour", "every day", etc) and you can give them a defined retention period (eg "keep for a week", "keep for 6 months", etc) which helps dramatically when stuff goes wrong.
Additionally, it can also do timed jobs to copy those snapshots to another machine, or even to a cloud provider (eg rsync.net) for a potential "off site backup" approach.
Hopefully that'll be useful in future, once you get the immediate problem sorted out. :smile:
@justinclift Thank you! I'm fairly new to self hosting and have learned a lot over the last six months. I suspected it was a PostrgreSQL issue so it's good to get that affirmed. I'll get on the mailing list and see what can be done.
And I think you're totally right re: backups. Now that I've got things in place in a way that I like, it's all about stability and fallbacks. I'll look into rsync and will make sure I have the right snapshots set up.
Closing this for now and will reply if the PostgreSQL people give me reason to believe it's not a database corruption issue.
Issue Summary
My self-hosted Redash instance won't fully boot and it appears the cause is an error in the Postgres container which causes the database to try to start, fail, and restart in a constant loop. I am hosting these containers in an Ubuntu VM hosted on a TrueNAS server.
Steps to Reproduce
I am self hosting Redash via a docker stack with the following containers:
I use Redash to combine data from two different SQLite databases by creating queries of each database and then using the internal query feature to combine the two. I created a number of queries for this and had a solid system up and running but sometime over the last two months the redash_server_1 container would need to be reboot in order to access the web UI. Without rebooting the UI would just load. Recently, though, the problem has gotten more significant. Now I can't access the WebUI at all and can no longer use Redash. Rebooting all containers, ensuring that their images are updated, etc... does not work.
Looking at logs for redash_server_1 I get the following error when restarting the stack:
Looking at the redash_postgres_1 container logs I see that it's stuck in a loop of trying to start the database, encountering an error and restarting:
Since initially setting up the queries and dashboards in my Redash instance, I have made no changes other than to update the SQLite files and refresh queries. At some point in this process an additional "redash_postgres_1-old" container was created which I have no recollection of. Neither postgres container is able to start a database.
Lastly, I'm mentioning this in case it's a possible cause: I recently learned that the drives I built my TrueNAS server with are SMR drives which are known to cause long read times and some filesystem instability in a server implementation. I have replacement drives arriving this week. Perhaps someone with more expertise than me will be able to identify that as the source of the issue.
Technical details: