craftcms / cms

Build bespoke content experiences with Craft.
https://craftcms.com
Other
3.29k stars 638 forks source link

[4.x]: SQLSTATE[HY000]: General error: 1712 Index matrixblocks is corrupted #13021

Open rob-c-baker opened 1 year ago

rob-c-baker commented 1 year ago

What happened?

Possibly related to: #12456.

Description

I have just seen this https://github.com/craftcms/cms/issues/12456#issuecomment-1378652643 SQL exception in our production environment with the matrixblocks table getting it's index corrupted.

It happened on Craft v4.3.3 and caused the exception to be thrown when anything read or attempted to write to the matrixblocks table.

Steps to reproduce

This occurred after a few hours of several (I believe 3 or 4) different content editors simultaneously changing a variety of different entries in different sections, but the common factor seems to be they were all editing entries that has a common field that is a Matrix content builder. The logic seems to fit: Many writes to the table from different db connections. The transactional approach should take care of this but not in this case it seems.

Expected behavior

No db table corruption

Actual behavior

500 Server errors on the front-end and in the CMS when doing any action that interacts with the matrixblocks db table.

DB Version

Control panel reports: MySQL 5.5.5
But SELECT VERSION(); reports: 10.6.12-MariaDB-0ubuntu0.22.04.1

Note

The fix suggested here works and rebuilds the broken db index: https://stackoverflow.com/questions/37153406/need-assistance-in-repair-of-corrupted-inndob-table which is essentially just:

ALTER TABLE matrixblocks ENGINE=InnoDB;

Craft CMS version

4.3.3

PHP version

8.1.2

Operating system and version

Ubuntu 22

Database type and version

No response

Image driver and version

No response

Installed plugins and versions

No response

rob-c-baker commented 1 year ago

Having read through the changelog it looks like there may be a few things in updates I don't yet have that may affect or improve things in regard to this issue. I'm going to have a look at getting up-to-date before attempting to re-create this issue (easier said than done considering the unknown nature / unpredictability of it's trigger).

angrybrad commented 1 year ago

@rob-baker-ar Just to verify, your matrixblocks table is already InnoDB and not MyISAM, correct?

rob-c-baker commented 1 year ago

Yes, all tables are InnoDB.

CREATE SQL for matrixblocks looks like this:

CREATE TABLE `matrixblocks` (
    `id` INT(11) NOT NULL,
    `primaryOwnerId` INT(11) NOT NULL,
    `fieldId` INT(11) NOT NULL,
    `typeId` INT(11) NOT NULL,
    `deletedWithOwner` TINYINT(1) NULL DEFAULT NULL,
    `dateCreated` DATETIME NOT NULL,
    `dateUpdated` DATETIME NOT NULL,
    PRIMARY KEY (`id`) USING BTREE,
    INDEX `matrixblocks_ownerId_idx` (`primaryOwnerId`) USING BTREE,
    INDEX `matrixblocks_fieldId_idx` (`fieldId`) USING BTREE,
    INDEX `matrixblocks_typeId_idx` (`typeId`) USING BTREE,
    CONSTRAINT `matrixblocks_fieldId_fk` FOREIGN KEY (`fieldId`) REFERENCES `fields` (`id`) ON UPDATE RESTRICT ON DELETE CASCADE,
    CONSTRAINT `matrixblocks_id_fk` FOREIGN KEY (`id`) REFERENCES `elements` (`id`) ON UPDATE RESTRICT ON DELETE CASCADE,
    CONSTRAINT `matrixblocks_ownerId_fk` FOREIGN KEY (`primaryOwnerId`) REFERENCES `elements` (`id`) ON UPDATE RESTRICT ON DELETE CASCADE,
    CONSTRAINT `matrixblocks_typeId_fk` FOREIGN KEY (`typeId`) REFERENCES `matrixblocktypes` (`id`) ON UPDATE RESTRICT ON DELETE CASCADE
)
COLLATE='utf8mb3_general_ci'
ENGINE=InnoDB
;
rob-c-baker commented 1 year ago

The installation that this happened to is now updated to Craft 4.4.5. All of our content editors are aware of what to look out for, although it's a hard ask for them to try and recreate the issue seeing as it takes down any part of the CMS or front-end that interacts with the matrixblocks table. I will try and replicate in a non-live environment but it's just me trying to reproduce an issue that may only appear with a number of users saving entries at around the same time.

I will update here when/if it happens again.

approached commented 1 year ago

I had the simliar error:

[web.ERROR] [PDOException] SQLSTATE[HY000]: General error: 1712 Index elements_sites is corrupted {"trace":["#0 /app/vendor/yiisoft/yii2/db/Command.php(1302): 

I made database dump and than full import.

rob-c-baker commented 1 year ago

@approached This would probably have fixed your issue instead of a database dump / restore:

ALTER TABLE elements_sites ENGINE=InnoDB;

Interesting this happened on a different table to the original issue above - I have been told this happened to the elements_sites table on our install a few weeks ago and they just rebuilt the index thinking it was a one-off.

angrybrad commented 1 year ago

Sounds a lot like this underlying MySQL bug that bit people on 8.0.29 https://xenforo.com/community/threads/psa-for-any-sites-on-mysql-8-0-29-data-corruption-incl-backups.207755/

It was "fixed" in MySQL 8.0.30, but only for freshly provisioned 8.0.30 instances: i.e. an 8.0.29 box that was upgraded to 8.0.30 would still be affected.

I'm not exactly sure of the MariaDB versions that would map to 8.0.29 and 8.0.30, but my guess is it's the same underlying bug.

rob-c-baker commented 1 year ago

This issue just happened again on our production environment (same table, same exception).

I'm planning to a rebuild of all the database tables during the least busy time for the site. Because it's InnoDB for all tables, a simple OPTIMIZE [table]; for each table should be sufficient, recreating each table on disk. Hopefully this will counter any underlying data inconsistencies at the database level that could be present based on @angrybrad's suggestion.

rob-c-baker commented 1 year ago

Rebuild from above done, however:

This just happened again twice within minutes. Only 1 content manager working on the site at the time. They reported that there were changing a simple text field a few times within a couple of seconds - adding text then removing. They could see the activity spinner indicating a draft was auto saving and they said that "the exclamation mark icon appeared a couple of times" (I know this indicates a server error). But they also said that the icon appearance is not unusual - it happens every now and then.

The relevant logs entries seem to be a couple of lock wait timeouts:

[2023-04-19T09:58:07+00:00] web.ERROR: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction The SQL being executed was: INSERT INTO `matrixblocks_owners` (`blockId`, `ownerId`, `sortOrder`)  SELECT `o`.`blockId`, '2130120', `o`.`sortOrder`  FROM `matrixblocks_owners` AS `o` INNER JOIN `matrixblocks` AS `b` ON `b`.`id` = `o`.`blockId` AND `b`.`primaryOwnerId` = '786642' AND `b`.`fieldId` = '77' WHERE `o`.`ownerId` = '786642'
[2023-04-19T10:00:17+00:00] web.ERROR: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction The SQL being executed was: INSERT INTO `matrixblocks_owners` (`blockId`, `ownerId`, `sortOrder`)  SELECT `o`.`blockId`, '2130945', `o`.`sortOrder`  FROM `matrixblocks_owners` AS `o` INNER JOIN `matrixblocks` AS `b` ON `b`.`id` = `o`.`blockId` AND `b`.`primaryOwnerId` = '786642' AND `b`.`fieldId` = '77' WHERE `o`.`ownerId` = '786642'

And then the next DB error, nearly 23 minutes later, an INSERT failing due to index corruption on matrixblocks:

[2023-04-19T10:22:50+00:00] web.ERROR: SQLSTATE[HY000]: General error: 180 InnoDB: Index matrixblocks_fieldId_idx for table `live-db`.`matrixblocks` is marked as corrupted The SQL being executed was: INSERT INTO `matrixblocks_owners` (`blockId`, `ownerId`, `sortOrder`)  SELECT `o`.`blockId`, '2133590', `o`.`sortOrder`  FROM `matrixblocks_owners` AS `o` INNER JOIN `matrixblocks` AS `b` ON `b`.`id` = `o`.`blockId` AND `b`.`primaryOwnerId` = '694109' AND `b`.`fieldId` = '49' WHERE `o`.`ownerId` = '694109'

And then shortly afterwards, a deluge of errors like this coming from the front-end:

[2023-04-19T10:22:50+00:00] web.ERROR: SQLSTATE[HY000]: General error: 180 InnoDB: Index matrixblocks_fieldId_idx for table `live-db`.`matrixblocks` is marked as corrupted The SQL being executed was: INSERT INTO `matrixblocks_owners` (`blockId`, `ownerId`, `sortOrder`)  SELECT `o`.`blockId`, '2133590', `o`.`sortOrder`  FROM `matrixblocks_owners` AS `o` INNER JOIN `matrixblocks` AS `b` ON `b`.`id` = `o`.`blockId` AND `b`.`primaryOwnerId` = '694109' AND `b`.`fieldId` = '49' WHERE `o`.`ownerId` = '694109'
[2023-04-19T10:24:34+00:00] web.ERROR: SQLSTATE[HY000]: General error: 1712 Index matrixblocks is corrupted The SQL being executed was: SELECT `matrixblocks_owners`.`ownerId` AS `source`, `matrixblocks`.`id` AS `target` FROM `matrixblocks` `matrixblocks` INNER JOIN `matrixblocks_owners` `matrixblocks_owners` ON (`matrixblocks_owners`.`blockId` = `matrixblocks`.`id`) AND (`matrixblocks_owners`.`ownerId` IN (785346, 2097636, 784833, 2000431)) WHERE `matrixblocks`.`fieldId`=177 ORDER BY `matrixblocks_owners`.`sortOrder`

These keep going until I run ALTER TABLE matrixblocks ENGINE=InnoDB; which takes a few seconds, temporarily fixes the issue but it comes back soon after, related to how much activity is in the CMS.

Note: the last 2 errors are slightly different but refer to the same thing it seems.

angrybrad commented 1 year ago

@rob-baker-ar

Rebuild from above done

Does that mean you rebuilt the database container/VM so that the new database was started from a fresh MySQL 8.0.30+ install and not an update?

rob-c-baker commented 1 year ago

@angrybrad No - I dropped all db tables and re-imported an SQL dump. Database version is as-above: 10.6.12-MariaDB-0ubuntu0.22.04.1 which is the latest version for a mainline Ubuntu Server 22.04 install.

It may be possible to switch to MySQL 8.x but that's not a trivial action - the database server has a number of other databases for other purposes and quite a few apps would need adjusting / testing so making the switch would need to be evidence based for justification to the business.

I have since found this issue which has very similar effects and refers to a potential bug that's still open but seemingly very elusive:

https://jira.mariadb.org/browse/MDEV-30531

However, what's going through my mind is that the DB server is on a mainline version of a popular server OS I should not be the only one that has this issue.

angrybrad commented 1 year ago

@rob-baker-ar

However, what's going through my mind is that the DB server is on a mainline version of a popular server OS I should not be the only one that has this issue.

Ubuntu 22 also shipped with MySQL 8.0.29 (which had the bug in it) for a decent amount of time, causing all sorts of headaches before they bumped it to 8.0.30. It wouldn’t surprise me if MariaDB has the same underlying issue, and Ubuntu also shipped with that for a while.

Assuming it’s the same problem, I wouldn’t switch to MySQL 8 but a newer MariaDB version (10.7 up to 10.11) and see if the behavior changes. And I also wouldn’t do an upgrade from your existing version but go straight to the newly provisioned version.

rob-c-baker commented 1 year ago

Update: all our sites are now on MariaDB 10.11.x and have been for a few weeks. This was not done as an upgrade. A fresh MariaDB install was set-up into which we pushed SQL dumps (as opposed to upgrading our installs relying on the DB to update it's in place data files).

We have not seen the headline issue since but the needs of the various sites have not dictated the more intensive content editing by several simultaneous users that we saw to trigger this in the first place.

rob-c-baker commented 1 year ago

This happened again with our fresh MariaDB 10.11 install and newly created databases (from SQL dump). It was the matrixblocks table this time.

As far as I am aware, nothing notable occurring in the CMS as was the case prior: only 1 person editing content at the time rather than multiple.

Logs show similar lines to https://github.com/craftcms/cms/issues/13021#issuecomment-1514661503

Exact same hack to restore functionality worked: ALTER TABLE matrixblocks ENGINE=InnoDB;.

I am starting to think about setting up a cron task that does nothing except run this SQL periodically. This is not a pleasant thought!

angrybrad commented 1 year ago

Assuming it’s the same problem, I wouldn’t switch to MySQL 8 but a newer MariaDB version (10.7 up to 10.11) and see if the behavior changes.

It’s been four months since I said that, but I’d like to retract it. :P

We haven’t had any other reports of this specific issue, but we’ve had plenty of other odd MariaDB issues we’ve spent countless hours debugging.

So much so that we’ve removed MariaDB from our recommended system specs for 4.x, and we'll be removing it as a supported database engine from Craft 5.

https://craftcms.com/docs/4.x/requirements.html#recommended-system-specs

And we’re using MySQL 8 as the default database engine for new DDEV Craft projects.

https://craftcms.com/knowledge-base/craft-database-options-and-best-practices#mariaDB

rob-c-baker commented 1 year ago

I understand that view. From our point of view it's a bit frustrating.

From the following, it looks like our issue is being experienced out in the wild (but not necessarily in Craft installs) and there is a fix coming down the line https://jira.mariadb.org/browse/MDEV-30531, although the point release with the fix is not out for our major version yet.

We originally started using MariaDB a few years ago based on the concept that it is significantly faster than MySQL and still open source. Based on my most recent tests/research the speed thing is still true.

We don't currently have good experience with PostgreSQL available in house, although it's looking like that might be the preferable option for us if we were only considering our Craft applications - in reality we have a number of other apps that rely on the fact that our Craft installs use MySQL / MariaDB databases so doing this sort of transfer is far from straightforward (custom Craft modules using secondary databases that other apps also query).

At least gives us a fair amount of notice that we need to get to grips with this before Craft 5.

angrybrad commented 1 year ago

@rob-baker-ar all fair. Random MariaDB bugs aside, you might want a backup plan for MariaDB, regardless.

https://medium.com/@imashadowphantom/mariadb-com-is-dead-long-live-mariadb-org-b8a0ca50a637

https://www.theregister.com/2023/07/05/mariadb_stock_exchange_compliance/

fvwanja commented 1 year ago

Hey there 👋 , just wanted to join the conversation to let you know, you are not allone 😉 So we also get the 1712.

MicrosoftTeams-image (4)

This error occurs on a specific page (/admin/entries/pages/303631-latin-america-office/) and also in the Queue Manager at /admin/utilities/queue-manager/370330 when trying to prune exta versions of that entry.

Bug was reported by our client on 23.10.2023.

Application Info

Plugins

We just ran some updates

We also did:

./craft db/backup
./craft db/restore

and cleaned the DB cache. We also ran ./craft gc/run.

We are currently waiting for feedback by our client. After the update we as developers could access the broken site (/admin/entries/pages/303631-latin-america-office/) without any error messages – but experience has taught us that this error could occur again.

We are currently working on updating MariaDB to >= 10.6.16, which should in theory fix that MariaDB Bug (https://jira.mariadb.org/browse/MDEV-30531) that was mentioned https://github.com/craftcms/cms/issues/13021#issuecomment-1681856883.

We will be back with updates ✌️

rob-c-baker commented 1 year ago

In way of an update from our point of view, the plan over here now is to side-step the issue and move to MySQL 8.x. I have been experimenting with switching between MariaDB and dockerised MySQL - doing it this way has meant I can test / prove the system by having both running, with different data directories and listening on different port numbers. Switching between them for testing is trivial.

It's not great seeing how much of a performance difference there is. We used to have query cache switched on for MariaDB which was concealing things a bit, but switching that off and levelling the playing field still results in around a 15-30% drop in performance depending what particular CMS screen / front end page we are testing. Our TTFB timings are not good enough yet. I am sure that the larger differences are down to things specific to our sites, especially on our front-end(s), so we are working through all the issues we can before we do the move.

There are few technical advantages to doing this switch, a somewhat tenuous one of note is one that the MySQL docs specifically talk about: consistency. MariaDB may query things faster on average but times for each specific query may vary a lot, especially with the query cache active. MySQL is far more consistent in it's query times.

fvwanja commented 1 year ago

Hello @rob-baker-ar ,

thank you for further information!

I would appreciate it, if you could warn users about that issue on /admin/utilities/system-report/ at „Requirements”, so that we all know that MariaDB is a poor choice for Craft CMS in general.

Also adjust the information here: https://craftcms.com/docs/4.x/requirements.html

brandonkelly commented 1 year ago

I have been experimenting with switching between MariaDB and dockerised MySQL

@rob-baker-ar Is MariaDB dockerized as well? If not, that could explain the performance difference.

@fvwanja We’ll discuss that internally, thanks for the suggestion.

angrybrad commented 1 year ago

@rob-baker-ar

but switching that off and levelling the playing field still results in around a 15-30% drop in performance depending what particular CMS screen / front end page we are testing

Are you seeing that on an individual query basis? i.e., find a query executed in the control panel on MariaDB, execute that same query on the MySQL 8 box, and it’s consistently 15-30% slower?

MariaDB and MySQL will have different sets of default configs that affect performance as well - it's probably worth running a SHOW VARIABLES query on both to compare how they are both configured.

rob-c-baker commented 1 year ago

Are you seeing that on an individual query basis?

Yes. It's mostly in that range. Some of the bigger differences are exhibited on the occasionally heavily customised / front-end queries we need, that may not have good index optimisations at the moment. I have been extracting some individual queries as test cases, which I have been running outside of the web context to remove some of the variables related to cache etc. These test cases tend to happen firstly on local dev, then on test env (which is a closer representation of live) and have virtually identical database content.

MariaDB and MySQL will have different sets of default configs that affect performance as well

I am certain there are differences here that are not fully optimised yet. While there are some similar / identical config tweaks to make between the 2, as you say, there are differences too. The way I'm looking at it is that we are making this move anyway, so I just need to get things to a level that is "OK" for now, via whatever mechanism and we can further optimise down the line.

@rob-baker-ar Is MariaDB dockerized as well? If not, that could explain the performance difference.

It's not in our testing environment, but both are dockerised in the local dev environments we use. There is a difference attributable to this fact but based on what I have seen, it's not everything. In all dockerised cases, the data directories for the DB servers exist outside of the container / volumes via bind mounts in an attempt to mitigate performance issues (and because this is possibly how it will need to work when deployed, at least until. the MariaDB part of the equation can be removed.)

fvwanja commented 1 year ago

Short update:

We upgraded to "MariaDB 11.1.2" and are now monitoring whether the error occurs again. Should this be the case, I will give a detailed report here.

Application Info

Plugins