LemmyNet / lemmy

🐀 A link aggregator and forum for the fediverse
https://join-lemmy.org
GNU Affero General Public License v3.0
13.25k stars 879 forks source link

[Bug]: Lemmy 0.18.3 BE - Couldn't run DB Migrations: Failed to run 2023-07-08-101154_fix_soft_delete_aggregates #3756

Closed wpuckering closed 1 year ago

wpuckering commented 1 year ago

Requirements

Summary

When upgrading Lemmy BE from the 0.18.2 container image to the 0.18.3 container image, database migrations fail with this error:

thread 'main' panicked at 'Couldn't run DB Migrations: Failed to run 2023-07-08-101154_fix_soft_delete_aggregates with: syntax error at or near "trigger"', crates/db_schema/src/utils.rs:221:25

Reverted back to the 0.18.2 tag and my instance is running normally again without issue.

For the Postgres database I'm using postgres:13.2-alpine.

Steps to Reproduce

  1. Spin down Lemmy 0.18.2 BE container.
  2. Change container image to reference 0.18.3 tag.
  3. Spin up Lemmy 0.18.3 BE container.

Technical Details

thread 'main' panicked at 'Couldn't run DB Migrations: Failed to run 2023-07-08-101154_fix_soft_delete_aggregates with: syntax error at or near "trigger"', crates/db_schema/src/utils.rs:221:25 note: run with 'RUST_BACKTRACE=1' environment variable to display a backtrace

Version

BE 0.18.3

Lemmy Instance URL

No response

dessalines commented 1 year ago

Where did you find that postgres version?

wpuckering commented 1 year ago

I've been using this container image for some of my self-hosted Postgres containers for a while now: https://hub.docker.com/layers/library/postgres/13.2-alpine/images/sha256-3335d0494b62ae52f0c18a1e4176a83991c9d3727fe67d8b1907b569de2f6175?context=explore

Admittedly I should get around to upgrading them all to a more recent version, but everything's been working fine for the 20 or so other services I host that use this image, so I wasn't in a rush. I didn't think the likelihood of hitting an incompatibility would be all that high, but maybe that's the case now?

Lemmy 0.18.2 runs fine with the database using that image. Does Lemmy 0.18.3 introduce some syntax for database migrations that's only valid on a higher version of Postgres?

dessalines commented 1 year ago

That's not a postgres version we support. All our CI jobs would fail if we used older postgres versions. Use the backup and restore docs to do a postgres upgrade.

wpuckering commented 1 year ago

Okay, I'll upgrade and see if it resolves the issue. Thanks!

RocketDerp commented 1 year ago

That's not a postgres version we support. All our CI jobs would fail if we used older postgres versions.

The supported "From Scratch" install instructions yielded PostgreSQL 14 on my Ubuntu 22.04.2 system, and those "From Scratch" installs were recommending Ubuntu 20.04 back at the start of June (not sure what version is default in the distro). There is no step in those instructions to use a specific version of PostgreSQL...

RocketDerp commented 1 year ago

Further follow-up:

"Ubuntu 20.04 comes with Postgres 12 from it's universe repository." source: https://pgdash.io/blog/postgres-13-getting-started.html

On September 14, 2021 - "From Scratch" instructions were commited to the Lemmy project recommending Ubuntu 20.04, cite: https://github.com/LemmyNet/lemmy-docs/commit/9ca86abb39fc4ea15b062faa3daf246ec7965ba4

If one tests and follows these directions as they were published, it is my understanding you would be on PostgreSQL version 12

I believe it is important to keep this issue open. "Lemmy From Scratch" installations could be impacted by these upgrade issues.

dessalines commented 1 year ago

We can't support from-scratch installs, just docker and ansible installs, and neither of those use the postgres version above.

RocketDerp commented 1 year ago

I added an EMERGENCY pull request regarding documenting that you do not support from-scratch and do not warn people of the upgrade failures: https://github.com/LemmyNet/lemmy/issues/3756

dessalines commented 1 year ago

We already have that warning: https://join-lemmy.org/docs/administration/administration.html

RocketDerp commented 1 year ago

We already have that warning

Can you be precise, exact quote or GitHub line-link to precise line of content, exact warning that upgrades will crash in 0.18.3 that I added?

wpuckering commented 1 year ago

Just chiming in, I took a backup of my Postgres 13.2 database, and restored it into Postgres 15.3. It's still having issues running the migrations, it fails at the same point but yields a more accurate error:

thread 'main' panicked at 'Couldn't run DB Migrations: Failed to run 2023-07-08-101154_fix_soft_delete_aggregates with: relation "comment" does not exist', crates/db_schema/src/utils.rs:221:25

Might possibly have something to do with this? https://github.com/LemmyNet/lemmy/issues/3501

wpuckering commented 1 year ago

Looks like somehow my comment table is missing. I didn't do anything on my part which should have caused it to be deleted, but that appears to be the root cause.

wpuckering commented 1 year ago

I'm back in business. Something went very wrong with the migrations in 0.18.3, but I wasn't able to pinpoint what it was (I didn't try too hard to be honest, I just wanted to get up and running again).

Here's what I did that put me back into a good state in the end:

I don't know if that will help anyone else or not. Something seems janky with the migration scripts, maybe only if you had been running with an older version of Postgres (like 13) since before 0.18.3. Just glad I got up to the new version. Will definitely exercise more caution when upgrading from here on.

dessalines commented 1 year ago

Were you by any chance running an rc, or main branch? I can't imagine why the comment table would go missing like that.

wpuckering commented 1 year ago

Never ran any release candidates, always stuck to the stable container images.

dessalines commented 1 year ago

Can you be precise, exact quote or GitHub line-link to precise line of content, exact warning that upgrades will crash in 0.18.3 that I added?

Screenshot_2023-07-28-18-48-04-445_mark.via.gp.png

RocketDerp commented 1 year ago

Thank you for responding. Has Lemmy.ml for the past 60 days been an example of a "supported" system? As the constant crashes caused by issue #2910 have confused me every single day since June 4. And why are you doing everything you can to avoid people opening PostgreSQL related topics?

I am BEYOND bewildered about what a supported system looks like. I know I have brain damage, which is not a joke, but I am having to develop advanced social theories to explain why issue #2910 was so ignored when it could be fixed with a simple SQL removal of the TRIGGER? PostgreSQL is the core engine of lemmy, it holds everything, I just don't understand why the rush to close issues immediately after a release instead of doing due diligence to be sure nobody else has the problem (which they won't likely find on a CLOSED support issue).

"The buck stops before PostgreSQL", support doesn't include #2910?

If this is all some elaborate social hazing, I think you guys are masters of it! #2910 since June 4! I personally created a community on Lemmy June 13 called "!lemmypperformance@lemmy.ml" in response to it. If it is all social hazing, you sure got me until about July 15.

RocketDerp commented 1 year ago

Is lemmy.ca a "supported system"? As they personally cloned their database and ran PostgreSQL EXPLAINS to find their own crashes on Saturday and Sunday. I had a pull request waiting for you Monday morning! "Ohh PostgreSQL is icky, close any reports and avoid touching #2910" seems to be the very problem - with constant crashes and overloads. I am beyond bewildered and again, having to develop sophisticated social theories to explain why you ignored #2910 when it is so clearly written and spelled out as a server-crash issue. "Supported"? I can't believe how much you RUSHED to close #3756 and avoid anyone seeing a PostgreSQL topic. Bewildered, dumbfounded, advanced social theories is all I have left! Hazing of some extreme order! June 4, JUNE FOUR, #2910 SUPPORTED!

RocketDerp commented 1 year ago

Explain your social behavior with issue #2910 - not supported?

Is lemmy.ml your server, or not? Do you not have the ability to fix GitHub issue #2910 with the simple REMOVAL of a TRIGGER? June 4.... 11 days later, on June 15 I am submitting lemmy.ml support tickets about the crashes related to #2910 PostgreSQL crashes:

image

Go ahead, RED CIRCLE what supported means in Lemmy project. Is lemmy.ml a SUPPORTED SERVER running DOCKER? Is issue #2910 going to be ignored the rest of your life, June 4 2023?

Are you TROLLING ME? Do you think ALL OF US here have never ran PostgreSQL in time-sensitive production for over a decade? Nobody else has written a messaging system like Lemmy? That nobody will notice the massive disk I/O and errors coming out of lemmy_server? TROLLING ME about lemmy.ca, lemmy.ml, and issue #2910 not being "SUPPORTED"?

RocketDerp commented 1 year ago

More from June, over a MONTH ago, about your personal labor to AVOID making PostgreSQL the central focus of Lemmy development TO FIX CRASHES of LEMMY.ML and OTHER SERVERS and support: https://github.com/LemmyNet/lemmy-docs/issues/233

THIS IS NOT A RHETORICAL QUESTION: Are you TROLLING ME AND OTHERS? With avoiding #2910 as a topic of "support" response time to clearly-described total server crashing that was obviously impacting lemmy.ml and others too? 50+ days to fix what anyone could do in 30 minutes with a SQL removal of the TRIGGER? Do you dispute these facts of times, dates, and estimates to fix the problem?

Why in GOD's NAME, did you close this issue #3756 as fast as you possibly could when you never tested Lemmy 0.18.3 on "From Scratch" installs? WHY? NOT RHETORICAL!

RocketDerp commented 1 year ago

You make a RELEASE on Friday, when you obviously do not work on Saturday and Sunday, ignoring just this recent Sunday pull-request with an EASY and FAST fix to server-crashing crisis of the entire Lemmy network of servers... (shall I bring up how you edited out the word "Emergency" first thing Monday when you started working?)

Was June 4, not June 1, July, Not June 3 but June 4 long enough to leave this issue ignored because it was PostgreSQL and not your Rust coding focus? Was #2910 not CLEAR ENOUGH in ENGLISH:

image

AGAIN: ARE YOU TROLLING THE ENTIRE REDDIT COMMUNITY WITH lemmy.ml CRASHING every day since May 25 that I have documented? And how RAPIDLY and "Johnny ON THE SPOT" you were closing out PostgreSQL related upgrade reports on fresh RELEASE of 0.18.3 - issue #3756?

Am I being trolled with the illusion that "PostgreSQL isn't supported, https://lemmy.ml/ crashing every single hour of every single day since May 25, 2023 because of PostgreSQL overload in lemmy_server logic isn't what we support! Again, again: is lemmy.ml an EXAMPLE of your "supported" servers? ignoring #2910 since June 4?

I am having to research and order additional learning material (books) about social hazing practices. It is all far beyond my personal experience and understanding what you think you are dong closing this PostgreSQL issue #3756 as rapidly as possible while avoiding the truth of PostgreSQL Issue #2910

RocketDerp commented 1 year ago

I personally suffer from extensive brain damage, and I am in the final time of my life. I have been hospitalized in mental institutions several times in recent years with brain damage problems.

So I have learned a lot about psychological behavior changes. And I know PostgreSQL seems to trigger total avoidance in the Lemmy project leadership. So I am spending my weekend reviewing, once again, how you could close issue #3756 so rapidly, while ignoring issue #2910 while your official project lemmy.ml server has been crashing every-hour of every-day since May 25 that I have personally documented it.

Let's use Lemmy community: https://lemmy.ml/post/2561833

image

I KNOW from over 35 years of production application development with SQL that issue #2910 was fixable in 30 minutes of labor, remove the TRIGGER with an SQL statement. So I can only get into advance social theories of why since June 4 you have been behaving this way about PostgreSQL topics.

RocketDerp commented 1 year ago

I am sorry I am brain-damaged fuck-up

https://lemmy.ml/post/2564614

RocketDerp commented 1 year ago

YOU TROLLED US SINCE JUNE 4, #2910

image

RocketDerp commented 1 year ago

Do you all death-wish brain-damaged humans?

RocketDerp commented 1 year ago

Ok, I'm sorry about my reaction to this, I have to face up that the Lemmy community has values that I overlooked.

I’ve had the wrong attitude about the server needing to be reliable, data loss importance. It was bewildering to me, beyond my imagination, that social things worked like this in modern day. But I need to face up that Elon Musk does it, and I don't get as bewildered as to it. Social things can be this way, and I have too much brain damage and reliving personal memories and experience with PostgreSQL, MaxDB and DB/2 that trained me to treat data differently.

I really failed to be open minded that the community thinks the crashes are fine. I'm sorry, and I guess baggage from past jobs and server responsibilities

When I step back and study the history of Microsoft vs. Apple. the “open hardware” approach of Microsoft proved to be abandoned. Apple always had the best loyalty by NOT offering hardware choice to their customers. and Linux, which offered the most hardware choices of all, was even less popular. Apple knew people didn’t like to pay money for software, so they were an operating system company who packaged it with hardware.

If you look at how Hollywood productions are run in terms of expenses for sets, clothes, shooting locations - it can be millions of dollars for 90 seconds of a movie. Because it’s in the sphere of social media, films are social.

It was hard for me to come to terms with. I drank for the first time since New Year I was in so much mental pain and anguish, which has more to do with deteriorating hardware than anything. I tried to get words out to convey my PostgreSQL experiences running with data in production, and I just didn't see that with social sites the rules are social and not about data integrity.

I should have stuck to topics like ChatGPT, Neil Postman, Marshall McLuhan. Sorry I'm such a mess and had all the wrong attitude about a project like this.

RocketDerp commented 1 year ago

Sorry I went so angry

This is why, Sunday night

image

k4r4b3y commented 1 year ago

I compiled from source in order to upgrade from 0.18.2 to 0.18.4. I am following the "Install from scratch" guide (I know it is not the official method of lemmy install, however, I am not familiar with docker, and like running programs directly on my debian 12 system).

After compilation, I tried restarting the lemmy systemd service, however it fails to run. In the journalctl, I see the following error message:

lemmy_server[17631]: thread 'main' panicked at 'Couldn't run DB Migrations: Failed to run 2023-07-08-101154_fix_soft_delete_aggregates with: syntax error at or near "trigger"', crates/db_schema/src/utils.rs:221:25 

Is there a solution to that? I have lost my lemmy instance because of that.

RocketDerp commented 1 year ago

Do you have a backup from the 0.18.2 before upgrade? PostgresQL restore is amazingly fast, it does all the INDEX and constraints at the end. What is your version of PostgreSQL, 15.4?

Here is the source file that is referencing: https://github.com/LemmyNet/lemmy/blob/main/migrations/2023-07-08-101154_fix_soft_delete_aggregates/up.sql

k4r4b3y commented 1 year ago

I have a backup before the upgrade. However, in my server, I have postgresql 13 and 15. I had the server running debian 11, and then I have dist-upgrade'd into debian 12, and along with it, came postgresql 15.

However, the psql db that lemmy is running is still 13 (up until lemmy v0.18.2).

To get things more trickier, I have been running a matrix dendrite server on the same VPS, on the psql 13 database (under a different database username, of course).

So, how would a database restore work in this case? I would like the dendrite database to be left unscathed and only migrate the lemmy db data from the psql 13 database to the psql database 15.

RocketDerp commented 1 year ago

yha, the issue you are responding to I think is validation that older PostgreSQL stopped working in recent version.... you have to be on 15.

PostgreSQL pretty much has you dump the entire database to a text file and re-import it, the backup/restore process is very fast - even if you have 400,000 posts and equal number of comments. From what I've seen, you should be able to backup from 13, restore to 15 - change the parameters to point to the new server - and start lemmy_server and it should get past the error you are having.

k4r4b3y commented 1 year ago

@RocketDerp are you in the Lemmy Matrix support rooms? If so, can you DM me, or ping me on one of the Lemmy Upgrade or Lemmy Install matrix rooms? My matrix username is @k4r4b3y:karapara.net

I would like to learn more about how I can do the lemmy database migration from psql 13 to 15. I have never used psql (or any other SQL software before).

Also, afaik, the lemmy is by default using the psql 15 on my system, however, the debian psql 15 have inherited the previous existing psql 13 database and running it. So, the software psql 15 is running the db psql 13 on my debian 12. How do I fix these things?

RocketDerp commented 1 year ago

are you in the Lemmy Matrix support rooms?

I'll be honest, I've been involved in development since the 1980's and I've always preferred search-engine facing content like Reddit as opposed to black-holes where nothing gets in search engines. Be it IRC chat or Discord. I know there is a long tradition of open source with IRC - but every time I ever visited those rooms on obscure projects nobody was around to answer. Not to say people always answer a Reddit or BBS posting.

I would like to learn more about how I can do the lemmy database migration from psql 13 to 15. I have never used psql (or any other SQL software before).

I did create a Lemmy community for this topic: https://lemmy.ml/c/lemmy_postgresql@bulletintree.com

I can try to make time to help you today... but Matrix Chat is just another platform to eat up time when I'm struggling to keep up with all the learning my self.

afaik, the lemmy is by default using the psql 15 on my system, however, the debian psql 15 have inherited the previous existing psql 13 database and running it. So, the software psql 15 is running the db psql 13 on my debian 12. How do I fix these things?

I assume these are running on different ports, tcp ports, I'd start there and talk specific port numbers.

k4r4b3y commented 1 year ago

@RocketDerp thank you for the detailed answer. I will create an account on the lemmy server you posted and detail my issue on the sub-lemmy you posted there. Is that ok?

I'd start there and talk specific port numbers.

I looked into those yesterday with phind.com AI assistant. And I realized that the ports 5432 and 5433 are being used. Afaik, 5432 belongs to the psql 13 and 5433 belongs to the psql 15.

RocketDerp commented 1 year ago

I will create an account on the lemmy server you posted and detail my issue on the sub-lemmy you posted there. Is that ok?

Fine with me, or you can pick any other active instance and subscribe to it.

Afaik, 5432 belongs to the psql 13 and 5433 belongs to the psql 15.

I'll look for your post in Lemmy. I'm RoundSparrow on Lemmy.ml - and I'll discuss that further

k4r4b3y commented 1 year ago

Thanks. I should have posted a thread on there in an hour.

k4r4b3y commented 1 year ago

@RocketDerp , I posted here: https://bulletintree.com/post/3047685

RocketDerp commented 1 year ago

ok, i'm going there now

akohlsmith commented 1 year ago

I have the same issue. I'm using an EXTERNAL postgres database and for similar reasons to others, the schema update failed. For me, however, restoring from backup after upgrading to Postgresql 15 (which deletes the old db) didn't work and I still have the same error.

Now I can manually fix the up.sql/down.sql (and have tested this) but docker-compose up still fails, presumably becuase the up/down scripts in the docker image are still "not fixed".

I did try docker-compose run lemmy /bin/sh to try to poke around inside the container but I'm quite sure that's not the right way to do this. I know enough about docker to be dangerous, but that's it.

gribodyr commented 1 year ago

For anyone who lands here trying to run a fresh "install from scratch", here's what worked for me.

  1. Make sure you have PostgreSQL 16 installed
  2. Make sure YOU DO NOT have previous versions of PostgreSQL installed (like 12 which is bundled with Ubuntu 20.2). To check:
    user@lemmy:~/lemmy$ pg_lsclusters 
    Ver Cluster Port Status Owner    Data directory              Log file
    16  main    5432 online postgres /var/lib/postgresql/16/main /var/log/postgresql/postgresql-16-main.log
  3. If you have older PG clusters running, do something like this. MAKE SURE YOU DON'T HAVE ANY DATABASES RUNNING ON THE OLD CLUSTER(S).
    pg_dropcluster --stop 12 main
  4. Make sure your PG16 port number is 5432 (if you had other clusters, it might be different). Change it to 5432 if necessary by editing postgresql.conf and restart PG if necessary.
  5. Repeat "Install from Scratch" instructions for setting up a DB on the PG16's cluster if necessary.