OPcache logging cannot handle multi-server environments

JanThiel commented 3 years ago

// UPDATE Describe the bug The logging of a manual opcache restart is triggered frequently as the compare is only capable of handling a single server. When there are more servers (cluster setup) the state in the WP site meta only reflects one server. Comparing the states of other running servers leads to false assertions and false logging.

https://github.com/Pierre-Lannoy/wp-decalog/blob/33f890c8e9cd40b8cc50e18e412734a877f404c4/includes/listeners/class-phplistener.php#L252

Steps to reproduce the behavior:

Run a single WordPress installation with a shared DB on two (or more) servers
Trigger a manual opcache_reset on one of the servers
The log will then be flooded with the false positive message due to the compare of manual_restarts => 0 (WP Site Meta) with manual_restarts => 1 (return of opcache_get_status(false) ) or the other way around - whenever the server with the other value executes code.

Fix: Store the Opcache status with a server identifier like the hostname

Quickfix: Restart php-fpm to align the status counter

Pierre-Lannoy commented 3 years ago

Hi Jan!

And thanks for this very - once again - very accurate issue :)

I understand very well your issue and I would like to investigate/discuss 2 ways to solve it.

1 - This event is a false positive. To (in)validate this assumption, I propose you to install (even temporarily) a plugin I've made and let it run 24-36 hours to see what comes out (analytics mode). For my part, I will investigate my code to see if an error on my behalf could produce a false positive. If we can see it's a false positive, I will do a fix, of course!

2 - This is not a false positive. Well, in this case you have a very big issue. Calling opcache_reset() at the frequency you indicate completely destroys your opcode cache tactic. On top of that it is completely counterproductive: calling this function forces the complete deletion of the cache and thus forces all scripts to be reloaded and reparsed the next time they are hit. Countless cpu cycles lost, and a slower site…

That's exactly why this events is leveled as "warning". It should not appear more than once or twice a day in the most extreme cases (excluding maintenance). It's not just an info or a notice. Its frequency indicates a true issue on your site. As a reminder, warnings are defined by "A significant condition indicating a situation that may lead to an error if recurring or if no action is taken."

In this case (not a false positive) my best advice would be to find the guilty plugin/theme/dev, and fix it! DecaLog just played his role: to warn you!

JanThiel commented 3 years ago

@Pierre-Lannoy Thank you very much for your insights. I am all in for identifying root causes instead of hiding the log messages. As the log usually does not lie ;-)

I just installed the opcache-manager plugin to trace back the root cause.

I will get back to you on this in a couple of days when there is enough data to analyze.

Thanks again :)

JanThiel commented 3 years ago

Just FYI: I might actually have a lead ... Seems that the PHP config for the CLI was missing the Opcache. Thus all Cron tasks were running without Opcache whereas all Web calls had Opcache available -.- I set the log level back to warning and will see what's happening. Besides of checking the Opcache Manager :-)

Pierre-Lannoy commented 3 years ago

Unfortunately, I don't think so... DecaLog is able to dissociate OPCache events between web and cli activity (as the message you pointed out shows). Note: just to be clear, OPCache Manager will not help you to identify from where come the call; nevertheless, it will clearly tell if the event is a false-positive generated by DecaLog or is a true warning...

JanThiel commented 3 years ago

Hmm. Thanks for the hint. That would have been too easy ... anyway. Regarding Opcache Manager: It is running for ~4h now but the dashboard is total empty. Is this expected behaviour?

Pierre-Lannoy commented 3 years ago

That's a strange behaviour. Could you confirm you have an operational cron (regardless of how it is "technically" done)?

JanThiel commented 3 years ago

Our logging says yes:

Pierre-Lannoy commented 3 years ago

OK, I just have a deep look on your screenshot and forget my last question: availability is ok (so, you have an operational cron ;) ), "cached file" is bad value, you should have something about 4K-6K files on a "standard" WP install with some plugins. And, worst: "hit ratio" is uncomputable. The main reason for that may be that OPcache is reset more frequently than 5 minutes (the frequency of analytics hits)...

Pierre-Lannoy commented 3 years ago

It's seems what DecaLog have detected is a true warning... You have really something that reset OPcache more frequently than 5 minutes.

JanThiel commented 3 years ago

Looks that way :-/ Still no clue what that might be... Anyway, thanks for your support :-) I will close this issue as this is nothing wrong with Decalog :-)

Pierre-Lannoy commented 3 years ago

Let me know when you find the guilty plugin, I'm very curious about this...

JanThiel commented 3 years ago

Hmm ... strange things. The Opcache Monitor is still empty. Yet I checked the stats that Decalog writes into the site meta. And that actually looks good and fine to me:

$ wp network meta get 1 decalog_web_server_php_opcache
array (
  'opcache_enabled' => true,
  'cache_full' => false,
  'restart_pending' => false,
  'restart_in_progress' => false,
  'memory_usage' =>
  array (
    'used_memory' => 188010616,
    'free_memory' => 80424840,
    'wasted_memory' => 0,
    'current_wasted_percentage' => 0.0,
  ),
  'interned_strings_usage' =>
  array (
    'buffer_size' => 12582464,
    'used_memory' => 12582464,
    'free_memory' => 0,
    'number_of_strings' => 178676,
  ),
  'opcache_statistics' =>
  array (
    'num_cached_scripts' => 6498,
    'num_cached_keys' => 9007,
    'max_cached_keys' => 65407,
    'hits' => 45844488,
    'start_time' => 1634283969,
    'last_restart_time' => 0,
    'oom_restarts' => 0,
    'hash_restarts' => 0,
    'manual_restarts' => 0,
    'misses' => 6875,
    'blacklist_misses' => 0,
    'blacklist_miss_ratio' => 0.0,
    'opcache_hit_rate' => 99.98500589829794,
  ),
)

In addition to that there are no new Opcache related warnings in the log. It seems kind of solved. Which is weird because all I changed was the mentioned addition of Opcache to the CLI process PHP config. I also added some debug code to the only plugin possibly calling opcache_reset (Yoast SEO btw., it is condition guarded but relies on versions and an option value), but that didn't fire till now.

JanThiel commented 3 years ago

I did some more digging @Pierre-Lannoy and I believe I found the issue.

We have a server cluster running. Thus the state of the opcache can be different per server as calls to opcache_reset only reset the opcache on one server... Yet the compare does not account such scenarios as it only stores one global value. So depending on the server executing the call, the opcache log compares the opcache status from Server A with the global stored information of server B, C or D etc.

I believe this is also the reason why the Opcache Manager does not work.

Pierre-Lannoy commented 3 years ago

Wow. Thanks for the hint. This would definitely make sense in light of the symptoms… And you're right, none of my OPcache routines are compatible in this architecture scenario. Once again, thanks for this report and its assessment / root cause.

I will fix-it in DecaLog for the next release (less than month, now)… and will try to understand how to do it in OPcache Manager - and it will not be so simple, I think :(

Pierre-Lannoy commented 3 years ago

Hi @JanThiel ! Currently working on this issue and have some questions… Can I assume the difference between instances (taking part in the cluster) are fully synchronised in term of local fs, php versions, installed modules, etc. (if so, how? - are they k8s nodes?)? If yes, the only "difference" between them is memory, opcode caching and maybe temp fs. Can you confirm this?

Thanks for your help.

JanThiel commented 3 years ago

Hi @Pierre-Lannoy Yes, all nodes are synced in terms of software, config and version of the software. In our scenario all nodes are bare metal. Config Management is done using Ansible.

PHP, OS, Patchlevel and such are all identical on all machines ( if nothing goes wrong ;-) ). Anything else would be insane.

The file system is NOT synced as all Network-FS have massive negative impact on PHP workload performance, as far as I am aware of. Keeping the FS in sync is job of the deployment processes and tools.

Speaking of: This means that the opcache state might be slightly different on each node. One cannot assume all nodes build up identically and remain in an equal state at all the time. We already saw different values on the nodes when debugging this issue. As written in the issue an easy way to trigger such a state is the opcache_reset(). Or more generally speaking: All operations executed by software on only one of the nodes.

Hope this helps. If you require any other information, let me know :-)

Pierre-Lannoy commented 3 years ago

YES, this helps 👍🏼 Thanks for this, I will focus my work on "volatile" things (like OPcache) and not on static ones (like versions).

Just a last question: so, in your scenario, all nodes have a different hostname?

JanThiel commented 3 years ago

Yup, each machine has a unique hostname.

Pierre-Lannoy commented 3 years ago

Haha. Thanks a lot. I confirm the 3.3.0 version will fix your issue.

Pierre-Lannoy commented 3 years ago

I'm closing this issue, thanks to 20ffedc commit. Don't hesitate to re-open it if needed after 3.3.0 release.

Thanks again for your report and help.

Pierre-Lannoy / wp-decalog

OPcache logging cannot handle multi-server environments #21