larssn commented 7 years ago

Last Thursday, our entire cluster was brought to its knees, including all the Wordpress sites we host.

We identified Jetpack being indirectly responsible.

Before I get into that, a little background on the relevant part of our setup.

So we're hosted on Amazon, and we use their Elastic File System (EFS) to host our Wordpress files. This allows multiple instances to share the same Wordpress files, and we don't have to worry about instances having different files. The EFS has burstable performance, so if you need a sudden throughput of several GB/sec of data, it can do that. When it does that, you use what is called Burst Credits. For us, its pretty important that these don't run out.

We don't use WP Cron, as we require a reliable job scheduler, that runs at the right times. For this we use Cavalcade, which makes sure that the same job doesn't run on multiple instances. A job it does really well.

October 24, we had a credit balance of 32.4TB, and then something happened - not sure if Jetpack was updated, or the stars had the wrong alignment, but 3 days later our burst credits hit zero. Image: EFS dying when credits hit 0

We've since determined that Jetpack apparently have sync jobs running - nonstop, that apparently uses a lot of disk IO, and also a lot of CPU (at zero load, our instance CPUs averaged at 21%). We're seeing the jobs jetpack_sync_cron and jetpack_sync_full_cron running, a lot. Like every minute... And they take more than a minute to finish, so we ended up seeing 3 of each of these jobs, per site (in our multisite-setup), running, constantly! This is crazy stuff!

Quickfix

If you're reading this, and are in a similar situation, modify sync/class.jetpack-sync-actions.php:

static function sync_allowed() {
                return false;
                /*return ( ! Jetpack_Sync_Settings::get_setting( 'disable' ) && Jetpack::is_active() && ! ( Jetpack::is_development_mode() || Jetpack::is_staging_site() ) )
                           || defined( 'PHPUNIT_JETPACK_TESTSUITE' );*/
        }

We also changed the job running every minute, to once per hour. If it actually does anything with the above change, don't know. We don't really care at this point. Our main concern is our credits staying up. With the above change, the credits stay at a nice even balance.

We're very dependent on Jetpack, and we've always considered it a good plugin. We still do; this is IT, these things happen.

We have to ask though, what can be so important, that you must run these jobs pretty much back to back? If you absolutely must run these sync jobs, then they either require a redesign, or a higher running interval, or a filter, so we, at least, can change the interval ourselves.

Thanks for reading.

TLDR; Jetpack sync jobs + Amazon EFS + external job runner = bad.

jeherve commented 7 years ago

Sorry for all the trouble!

We're seeing the jobs jetpack_sync_cron and jetpack_sync_full_cron running, a lot. Like every minute

They do indeed run every minute by default.

what can be so important, that you must run these jobs pretty much back to back?

This synchronization allows your data to be synchronized back with WordPress.com, and consequently gives us more reliable / up to date data to be used with modules relying on WordPress.com, like Publicize, Subscriptions, Related Posts, Stats, and the site management tools on WordPress.com.

If you absolutely must run these sync jobs, then they either require a redesign, or a higher running interval, or a filter, so we, at least, can change the interval ourselves.

We've made a lot of changes to Sync in the past few weeks to address issues like those you've experienced, so it might be worth giving our current Alpha version a try. It's available here: https://github.com/Automattic/jetpack/archive/master-stable.zip

If you'd rather not use a development version of Jetpack on your production sites, you can use the jetpack_sync_incremental_sync_interval and jetpack_sync_full_sync_interval filters to change the frequency of the synchronization. The 2 filters are available in the current version of Jetpack (4.3.2). Here is how you could change the overall frequency to 10 minutes, for example:

function jp_support_5513_sync_schedule( $schedules ) {
    if ( ! isset( $schedules['10min'] ) ) {
        $schedules['10min'] = array(
            'interval' => 10 * MINUTE_IN_SECONDS,
            'display' => __( 'Every 10 minutes' ),
        );
    }
    return $schedules;
}
add_filter( 'cron_schedules', 'jp_support_5513_sync_schedule' );

function __return_jp_support_5513_10_min() {
    return '10min';
}
add_filter( 'jetpack_sync_incremental_sync_interval', '__return_jp_support_5513_10_min' );
add_filter( 'jetpack_sync_full_sync_interval', '__return_jp_support_5513_10_min' );

If you get the chance to test the development version of Jetpack, let us know how it goes!

larssn commented 7 years ago

Thanks for replying, and don't worry about it. We realise these things happen.

We normally don't run alphas in production, but thought we'd give it a go.

Results Our nodes immediately saw a big spike in CPU (>50%), and our cluster started scaling.

So we had to revert back to our previous solution.

Neither we, nor our customers use wordpress.com for anything atm. So having a job this heavy, doing a task we don't need, is just unwanted. And tbh, I think we'll just turn it off entirely.

Does this functionality benefit your customers, or you? Because if the answer is the latter, then the right course of action is a redesign. Maybe it just runs unfortunately on our setup, hard for us to say.

Still think nothing could be that urgent, that you'd need to run this job back to back. If it truly is, then a non-php solution might be preferred.

Anyway, thanks for the quick response!

jeherve commented 7 years ago

Thanks for giving that a try!

Neither we, nor our customers use wordpress.com for anything atm. I think we'll just turn it off entirely.

Does this functionality benefit your customers, or you?

If your customers do not use any of the modules that rely on WordPress.com, you could activate Jetpack's Development mode.

If, however, some of your customers use Jetpack features like Subscriptions or Publicize, completely disabling sync will be problematic, as posts will stop being sent to their subscribers, or posted to their connected Social Networks.

Maybe it just runs unfortunately on our setup, hard for us to say.

That's most likely the case here, but since you gave us specific details about your setup we should be able to look into this, understand what happens, and find a way to fix things. However, in order to be able to debug the problem, we would need a few examples of the site URLs affected by the problem so we can check our logs and try to understand why sync is so slow. Could you post a few examples here, or send them to us via this contact form?

Thanks!

larssn commented 7 years ago

If you need more details on the intimates of our cluster, let me know.

Our customers are mainly small businesses and restaurants, and none of those use Subscriptions/Publicize. We'll check out development mode.

Here's a few sites running on this setup: https://www.cmiile.com https://www.yesushi.dk https://www.chinawokhouse.dk

What exactly can you gauge from having these?

EDIT: Dev mode is a no-go, as it disables Photon.

EDIT2: Tried the filters from your first reply. Monitored the CPU and the running cron jobs (via WP Crontrol), it seems to have zero effect: The jobs seem to still chain, and actually ignore the fact that they should now only run every 30 min (we set it to 30). So the effect is the same - high CPU.

We also saw this last week when we were scrambling to fix the issue. Initially we hardcoded your 1min job to 3600 sec instead, and it had zero effect. Only the quickfix above had an effect.

And thus we're back at the quickfix.

lezama commented 7 years ago

@larssn thanks for the detailed report.

We are working on a PR (#5528) that adds the possibility to completely opt out form using cron for sync purposes.

It still needs some testing so don't try it on production yet 😅

larssn commented 7 years ago

Exciting stuff. Thanks for the attention, means a lot to us. :)

tillkruss commented 7 years ago

Same issue here using Cavalcade as the cronjob runner. WordPress is hosted on Heroku with DISABLE_WP_CRON set to true.

Running a cronjob every minute would be okay, but the jetpack_sync_cron and jetpack_sync_full_cron cronjobs are multiplying over time and Cavalcade is ending up running dozens of cronjobs simultaniously.

screen shot 2016-11-07 at 1 33 38 pm

We disabled the jetpack_sync_* cronjobs using the filter below until this is resolved.

add_filter('schedule_event', function ($event) {
    return strpos($event->hook, 'jetpack_sync_') === 0 ? false : $event;
});

larssn commented 7 years ago

Here's Amazon's analysis of our 3 day crisis:

We looked into your file system, and found that during the 3-day window you mentioned, your file system was driving ~600 NFS open operations per second, ~600 NFS close operations per second, ~600 NFS access operations per second, and ~1,700 NFS getattr operations per second. These operations collectively generated metadata throughput of more than ~10-11 MB/s, or ~35-40 GB/hour, which is the level we see on your CloudWatch chart.

lezama commented 7 years ago

@larssn, @tillkruss, We just merged #5528, it turns off using cron for sync purposes by default, it was working great on standard setups but it was causing nightmares on some particular configurations like yours.

We would love to see if it improves the situation for you all. If you could try https://github.com/Automattic/jetpack/archive/master-stable.zip (the built version of what's in master right now) or wait for the first 4.4-beta1 (it's going to be released very soon) that would be a big help to us.

TLDR; Jetpack sync jobs + Amazon EFS + external job runner = bad.

@larssn I am still curious to know what was causing the misbehaviour on your setup, how could I replicate a similar stack with the same external job runner?

Thanks for all your help and patience.

larssn commented 7 years ago

@lezama Are you familiar with the AWS stack at all? It would make explaining a lot easier.

lezama commented 7 years ago

@larssn, basic knowledge, but I am pretty sure @gravityrail has the required knowledge to help me if I don't get something :)

larssn commented 7 years ago

@lezama Ok lets see. I think the bare minimum for a proof of concept would be one EC2 (with Ubuntu 16.04.1) instance, with a mounted EFS (which has WP on it. Mount instructions are included when provisioned). Also need a DBS which can be pretty much anything. Basically a standard WP setup, with the only difference that the wp files are on a network file system.

PHP needs to be compiled with pcntl, I'm sure its possible to find a precompiled one with that (if they arent all already). Cavalcade-Runner comes with a systemd script, which need to be placed in /lib/systemd/system and point to where the cavalcade-runner executable is. Finally, WP-CLI.

I think thats the minimum.

larssn commented 7 years ago

@lezama We'll see if we can't squeeze in a test of your update, next week.

lezama commented 7 years ago

Great!

larssn commented 7 years ago

We have a busy Friday today, so might not be able to test it until next week. We'll see how it goes.

larssn commented 7 years ago

@lezama Should we still use https://github.com/Automattic/jetpack/archive/master-stable.zip for our test?

lezama commented 7 years ago

@larssn, Yesterday we shipped a new version, you can just upgrade the plugin from the .org repo.

Finally, we do use cron, but we reduced the amount of jobs created and also improved the way we unschedule them.

Please, let me know how it goes.

larssn commented 7 years ago

It didn't go well. Our cluster immediate jumped to about 46% CPU utilization. We let the sync run for 20 min, to see if the spike was temporary, it stayed at 46'ish % throughout.

And so we've turned it off again.

This was with v4.4.1.

@lezama Did you get a test setup with an EFS up and running?

lezama commented 7 years ago

Not yet, it's on our priority list to figure out what's going on here.

tillkruss commented 7 years ago

https://github.com/humanmade/Cavalcade/issues/28 https://github.com/humanmade/Cavalcade/issues/29

lezama commented 7 years ago

Thanks for the links @tillkruss

@larssn have you tried @dd32 patch?

It is possible to completely disable Jetpack cron usage, via wp shell doing:

Jetpack_Sync_Settings::update_settings( array( 'sync_via_cron' => 0 ) );

dd32 commented 7 years ago

Just to follow up on this, since I ran into it..

Honestly, this is 100% Cavalcades problem and not a Jetpack issue - although Jetpack triggered it, a bunch of other plugins can trigger it too (I don't have a list handy). https://github.com/dd32/Cavalcade/commit/fbd23d263521cd4192598deed98706defb8d9bac is a good temporary patch, but it has shortcomings and Cavalcade needs fixing via humanmade/Cavalcade#29

larssn commented 7 years ago

Thanks for investigating.

EDIT: So we've tried dd32's patch. Unfortunately it made no difference. A proper fix for Cavalcade might be required.

enejb commented 7 years ago

@larssn Does https://github.com/Automattic/jetpack/pull/5879 Help with this issue. It is not merge into master.

ebinnion commented 7 years ago

5879 is now merged into master.

As a note, we also merged #5996 which aims to limit the length that jobs run. This change has helped a couple of cases where we were overloading servers when syncing.

larssn commented 7 years ago

@enejb We have sync turned off for now. When 4.5 is officially released, we'll try again.

rmccue commented 7 years ago

As the author of Cavalcade, sorry about this. I also agree that this is a Cavalcade issue, and I think you're fine to close this issue out on the Jetpack end.

larssn commented 7 years ago

@rmccue No worries mate, these things happen.

@lezama I don't mind this being closed. However, if you are working on other related multisite optimisations, it might be relevant to keep it open?

Your call; - we've definitely determined where the main problem is.

lezama commented 7 years ago

Thank you all, closing it for the moment.

larssn commented 7 years ago

Just a follow up:

Using the latest version of Jetpack, the extra CPU is insignificant, maybe 5% extra per node, which is great!

Thank you.

earcos commented 7 years ago

@lezama Is the disable sync option coming anytime soon? Thanks a lot and thank you for all your great work 😄

lezama commented 7 years ago

@earcos the option to disable sync via cron has been there since a while ago 😅

One can set the blog option jetpack_sync_settings_sync_via_cron to 0 and Jetpack should stop using wp-cron in order to sync.

Automattic / jetpack

Critical: Using an external scheduler (cron) on AWS EC2 instances causes MASSIVE disk and cpu use. #5513

Last Thursday, our entire cluster was brought to its knees, including all the Wordpress sites we host.

Quickfix

5879 is now merged into master.