Pierre-Lannoy / wp-decalog

Capture and log events, metrics and traces on your site. Make WordPress observable - finally!
https://perfops.one/
GNU General Public License v3.0
64 stars 8 forks source link

WP-CLI Out Of Memory on search-replace actions with large serialized objects - Buffer? #64

Open JanThiel opened 2 months ago

JanThiel commented 2 months ago

Describe the bug Hey @Pierre-Lannoy hope you are doing well :-) This is quite an "early" bug report while we have not yet 100% identified the root cause but having some quite reasonable ideas of it. Hoping for some highly valued input of yours.

The issue: We face(d) massive memory spikes when using WP-CLIs search-replace on tables with many large (~100kb each, several 100s of them) serialized objects. These serialized objects leads WP-CLI to use PHP for the search-replace on these fields. And as such using the WP PHP API (update_... functions) to de-serialize, search-replace and then serialize and update the data again. Performing this on a table of ~80MB size lead to a memory consumption of >20GB ending with Out of Memory killer being triggered. Running the same wp search-replace ... with --skip-plugins=decalog mitigated this issue. The search-replace went through in seconds. No memory spikes, no mentionable memory consumption.

We assume - but again have not verified it yet (sorry!) - the buffer option of Decalog to be the root of this. If this applies to WP-CLI calls as well, it might explain the seen behaviour as the log calls with all the data in it are stored in memory and would only be logged once the CLI "request" ends?! Does that makes sense?

If you agree on this as a possible source, could you suggest a "hotfix" we could try out to ignore the buffer option in CLI sessions? We could try this out and give you feedback on it as we have this issue reproducible ;-)

Notable: We use a redis object cache (Object Cache Pro) which is inflicted as well. Yet when disabling Decalog that component does not seem to cause any issues. And we still use Datadog as target log system.

To Reproduce Steps to reproduce the behavior (not verified yet!):

  1. Fill the wp_postmeta table with many large, serialized and nested objects
  2. Make sure the buffer option in Decalog is enabled
  3. Run wp search-replace old-value new-value wp_postmeta --dry-run

Memory should spike and the operation should take a long time.

Expected behavior No memory spike and only some seconds execution time. Maybe ignore the buffer option in CLI sessions and always flush the log calls directly? This might increase the CLI runtimes slightly but will prevent scaling effects. If these are the case here.

Environment (please complete the following information):

Thank you very much as always :-)

Pierre-Lannoy commented 2 months ago

Hello @JanThiel ! Hope you're doing well too… Thank you so much for this comprehensive report, which will - once again - really help me 😍

I will try to reproduce it and see if it's the buffering mechanism that produces this "effect". My preliminary analysis is that your assumption seems correct… I'm more concerned (if that's the case) about finding a way to solve this problem. My intuition tells me that omitting the buffer is likely to make processing time incredibly long (one API call for each event). Maybe processing bulk-calls (like if the buffer has a "limit" before sending) would be better. Don't know yet.

Anyway, let's start by confirming your analysis.

Can you just tell me if you can see if such a wpcli command produces many events?

Pierre-Lannoy commented 2 months ago

Oh, and by the way, if you really need to execute this command now, the best approach for me seems to pause the logger before the command and reactivate it after. You will be "blind" and I agree it's a dirty workaround. But it is better than other methods I can think of…

JanThiel commented 2 months ago

Hey @Pierre-Lannoy,

very well over here in Germany except of the common spreading stupidity. But let's not jump into politics :-) Thank you very much!

Using the WP-CLI flag --skip-plugins=decalog works fine :-) As it skips loading decalog within the WP bootstrapping it's an easy workaround with no need of manual deactivation of the plugin. So it's an easy and acceptable workaround we added to our standard procedures internally.

Can you just tell me if you can see if such a wpcli command produces many events?

In general thre aren't really any entries within Datadog about these operations as we have the log level reduced in our Prod env. We investigated further today and can confirm that WP-CLI does not use the WP PHP API to update the DB values. It's only happening via SQL and a custom PHP script for serialized data ( https://github.com/wp-cli/search-replace-command/blob/main/src/Search_Replace_Command.php#L539 ).

That does make me wonder where Decalog enters the call stack.

What did cross my mind was the question whether it might be helpful to be able to get the Decalog "outputs" printed out within a CLI session when they happen. Like the --debug flat of WP-CLI. That would allow us to actually identify what it might be that happen in Decalog while the search-replace command runs. Would be interesting for manual app debugging as well in general.

Regarding your thoughts on the buffer and a potential limit: I highly agree that no buffer is no solution either. Especially when HTTP calls are invoked. Kind of a buffer limit (time or memory based?) with forced flushes seems smart to me too on second thought. In addition to the current issues we commonly hit situations in the past where the log output of Decalog was lost whenever a script hit the PHP memory limits and died with OOM as the buffer was unable to be flushed at all. That might be solved by such an approach as well. At least partly. Currently we were kind of blind in these situations. Only the local log files contained the information about the OOM event.

Have a great night,

Jan

Pierre-Lannoy commented 2 months ago

Hello @JanThiel ! I haven't forgotten you ;) So far, I haven't been able to reproduce your problem. I have, however, discovered a memory management problem that could lead to what you're experiencing. I'm still working on it, but I'll be releasing 4.2.0 version in some hours/days (which adds a few new features for those who use Datadog on custom endpoints - not your case) but that won't solve this issue. I plan to fix this issue just as soon the 4.2.0 is released. Sorry for this delay.

What did cross my mind was the question whether it might be helpful to be able to get the Decalog "outputs" printed out within a CLI session when they happen.

There is already a way to do (a sort of) this (but, yes, it requires shmop php extension to be enable): just launch a second terminal and do a wp log tail --level=info. But I understand it is not really what you ask for. Let me think about such a feature…

Have a nice day :)

Pierre-Lannoy commented 1 month ago

Hello @JanThiel 😊 So, here are the news about this issue…

I haven't been able to reproduce your issue. Even with high events-load. But I understand you experience it and so, maybe, others. So, I started from the assumption that the root cause that seemed the most likely was the right one and I implemented:

The default values are buffer size = 1000 events and no buffering for WP CLI. It is possible to change that by revealing in DecaLog options tab the "advanced settings". To do that, just add: add_filter( 'perfopsone_show_advanced', '__return_true' ); where it seems the more appropriate…

But, as I'm not able to reproduce the issue, I have no way to verify this modification fixes it… If you want to test it, it would be marvelous 😬

JanThiel commented 1 month ago

@Pierre-Lannoy Thank you very much - as always. Highly appreciate your thoughtful and forthcoming work :-)

I just needed to confirm that we have reliable test-cases to check against before getting back to you. And I am now comfortable that we have. I just ran some benchmarks we created and have reproducible increased runtimes of wp search-replace of ~30% when decalog is active. With --skip-plugins=decalog the same CLI runs about 30% faster.

Let me know how to give the fix a shot :-)

Pierre-Lannoy commented 3 weeks ago

Hello @JanThiel ! Have you had a chance to try out the fix?

JanThiel commented 3 weeks ago

Still waiting for your reply:

Let me know how to give the fix a shot :-)

Pierre-Lannoy commented 3 weeks ago

OMG! Sorry… The current version in master branch contains this fix. To try it, just checkout the master branch, and use perfopsone_show_advanced hook if you want to tweak/test other parameters. So sorry for the misunderstanding…

JanThiel commented 3 weeks ago

No problem and thank you very much for the clarification! 😀

I will run the test on Monday and get back to you with the result.