OpenMage / magento-lts

Official OpenMage LTS codebase | Migrate easily from Magento Community Edition in minutes! Download the source code for free or contribute to OpenMage LTS | Security vulnerability patches, bug fixes, performance improvements and more.
https://www.openmage.org
Open Software License 3.0
865 stars 436 forks source link

High CPU Usage and MySQL Server Inaccessibility during catalog_product_flat reindexing #3309

Closed Winfle closed 1 year ago

Winfle commented 1 year ago

Preconditions (*)

Open Mage 19.5.0-rc3 Mysql server: AWS Aurora MySQL (3.03.0 - 8.0.26 compatiable) Mysql CPU 100% during reindex process

The issue

Hello, guys I'm encountering a critical issue during the reindexing process of the catalog_product_flat index in my OpenMage project. The problem leads to a spike in CPU usage, causing the MySQL server to become inaccessible. I have also noticed significant waits in SYNCH and IO Mysql processes, which may be related to the issue.

Here are some statistics regarding the problem. These are waits generated by Performance Insights of Aurora SQL:

wait/synch/sxlock/innodb/trx_purge_latch: 55.54 AAS wait/io/table/sql/handler: 21.32 AAS wait/synch/mutex/innodb/fil_system_mutex: 14.02 AAS These statistics indicate abnormal behavior, as the normal total AAS should be under 16.

Waits graph: image

CPU graph: image

From the official documentation: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/ams-waits.sx-lock-hash-table-locks.html I can see, that wait/synch/sxlock/innodb/trx_purge_latch wait event occurs when the workload must access the data in files on disk or when blocks are freed from or added to the buffer pool's LRU list. So, it could be related, to that fact, that indexes are used in non-optimal way.

The only fix for now is to restart mysql server and disable reindex process at all. Maybe somebody faced with this issue before? Any help will be appritiated.

Thanks

pquerner commented 1 year ago

My knowledge about the magento (CE) indexer is limited and possibly outdated (or forgotten), but I have had my fun time with it aswell.

How many products and categories are we talking about in your instance? Is it the stock-indexer or do you use some external module on top of it?

In the past we had to resort to a custom module (I think it was AsyncIndex by Magento-Hackathon, which indexed the products for us in badges (since we had many product changes, which also happened quite often, on a very large catalog). (I think we had something else on top of it, but I am not sure anymore... I will edit it once I remember)

Does the same thing happen on a developer machine, so you can pin-point the issue better perhaps?

luigifab commented 1 year ago

Sorry for my stupid question, is it new with rc3? or do you have the same problem with previous version?

Winfle commented 1 year ago

@pquerner Hey, thank you for the feedback! We have around 368 000 products and around 1180 categories with 4 stores. We have a plugin for another index type - https://www.dnd.fr/magento-patch-how-to-optimize-re-index-processing-time-for-url-rewrite/ (URL Rewrites reindex)

As far I know, the issue appeared around 10 days ago, but we didn't make any change. So it could be related to some internal limit, and then it started to crash..

So, my goal for now is to investigate what is the factor affects MySQL state. Thanks

Winfle commented 1 year ago

Hello, @luigifab. Thank you for your input! It was stable for 3 months, but 10 days before this index started to crash. We migrated to OpenMage 3 month ago, but we haven't noticed this issue until now.

ADDISON74 commented 1 year ago

If the solution worked until 10 days ago, it means that the problem is elsewhere than in OpenMage, provided that the source code has not been modified. I would check the updates of the packages on the server from the last month, especially the version of MySQL.

In such situations, it must be established from the beginning if the problem is from the source code of the framework or from the applications that run it. I would move the whole OpenMage directory and database to a local test environment.

Winfle commented 1 year ago

@ADDISON74 Hello. I think, that is complex issue in terms, that during particular index workload, it can lock some specific resources, that are used by current frontend load. Of cource, it could not be a code issue, but understanding of processes, that happens inside is important, since we have a correlation between particular index and mysql server outrage. I just want to investigate all factors, that can lead to this problem, including codebase. Maybe solution will be just delete not needed products, but I want to know the exact problem. Also, I would like to add, that certain queries that stuck are frontend ones, but I think, they are just "victims" of corrupted / locked state of reindex.

So, this is probably more about specific case issue, but not OpenMage in general. However, it can be useful for other people, because soltion solving generally lays not only in code part area. Appritiate your input!

ADDISON74 commented 1 year ago

In this case we will keep the ticket open for a period of 2 weeks, then if there are no updates we will move it to the Discussions > Q&A section.

Winfle commented 1 year ago

@ADDISON74 thank you for the help!

rvelhote commented 1 year ago

@Winfle Just some brainstorming of things I thought of when reading this:

tmotyl commented 1 year ago

FYI, This PR will help make the issue smaller https://github.com/OpenMage/magento-lts/pull/3267 by limiting the number of entries to index

Winfle commented 1 year ago

@ADDISON74 I will post a solition here tomorrow. I had to use another strategy of reindexing

pquerner commented 1 year ago

The oldest lie on the internet "will post solution tomorrow" :D

Winfle commented 1 year ago

Attribute Optimization: We sorted the attributes by usage and disabled those that were associated with less than 10,000 products. By reducing the number of attributes in the flat table, we minimized the workload during reindexing. (we had 128 attributes and 400k of products)

Temporary Flat Table: (actual fix) Instead of altering the current catalog flat table, we created a temporary flat table to perform the necessary updates. This approach eliminated the CPU lock that occurred during the alteration process. After completing reindex, we renamed the temporary table to replace the current flat table. I think, main problem, is that it uses foreign key from flat to catalog_product_entity table. So, when it performs update, a row lock is occured. As reindex process completes reindex horisontally (one attribute for all products), it makes significant impact.

Disable foreign key checks: Before starting the reindexing process, we temporarily disabled foreign key checks in MySQL. This can help alleviate the row locking issue. However, it's important to note that this approach should be used with caution and only in a controlled environment.

By implementing these steps, we successfully resolved the CPU usage spike and improved the efficiency of the reindexing process for the catalog_product_flat index in our OpenMage project.

Additional tools: we used OtterTune service to monitor performance and provide recommended settings for MySQL server. I think, it was really helpful, despite it didn't make an inpact on the issue. Just a general optiomization for webserwer.

If somebody is interested, I can post this solution here. It's not perfect, but did the trick.

Thanks for the support all!