apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
696 stars 158 forks source link

Deadlock in shm statistics #1775

Open Dead2 opened 6 years ago

Dead2 commented 6 years ago

This has been a problem in our systems for years now, but we only recently got around to doing some debugging on it. Since the mod_pagespeed EL7 rpm does not include debug symbols, we had to compile it ourselves too, causing a delay due to the hurdle of figuring out how getting repeatedly postponed. So, after it was compiled, it was fired up on one of our production servers (very hard to reproduce). It took a couple weeks or so and bam, apache was deadlocked. We got 4 core files, probably there should have been more, not certain what happened there, as this happened during the night-watch and I was not the one that restarted apache. Due to this bug, mod_pagespeed has been disabled on nearly all our production servers for nearly a year now.

It looks like 3 of the backtraces want to obtain a token to optimize a .png, a .css and a .css file respectively, one of the backtraces seem to be idle. There should be more apache processes, but no such core files, so unfortunately we don't know what they were doing at the time. The server probably is up to ~200 processes at the time of the deadlock. A tiny trickle of connections still gets served during the deadlock, probably the lucky ones that get in after a locked connection times out and then only if they try to access a vhost without pagespeed (/rewrites) enabled.

This server used

Mpm-itk (not a mpm) is a module for the prefork mpm that makes each fork run as root (with limits) until it receives a connection so it knows what vhost and thus what user it needs to change to. This might possibly be a factor in this bug, but we have not found any evidence suggesting so.

Relevant parts of global mod_pagespeed config below. What filters were enabled for the 3 specific vhosts found in the backtraces is unknown.

    ModPagespeedEnableCachePurge on
    ModPagespeedCacheFlushPollIntervalSec 5
    ModPagespeedLRUCacheKbPerProcess   768
    ModPagespeedDefaultSharedMemoryCacheKB 262144
    ModPagespeedShmMetadataCacheCheckpointIntervalSec 1800

    ModPagespeedNumRewriteThreads 3
    ModPagespeedNumExpensiveRewriteThreads 1
    ModPagespeedRewriteRandomDropPercentage 10
    ModPagespeedRewriteDeadlinePerFlushMs 16

    ModPagespeedInPlaceResourceOptimization on
    ModPagespeedInPlaceRewriteDeadlineMs 12
    ModPagespeedEnableFilters in_place_optimize_for_browser

root.core

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fdb162d8dcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007fdb162d8c98 in __GI___pthread_mutex_lock (mutex=0x7fdac05e5e80) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fdb0e2eab6c in net_instaweb::(anonymous namespace)::PthreadSharedMemMutex::Lock (this=0x563970318368) at pagespeed/kernel/thread/pthread_shared_mem.cc:67
#4  0x00007fdb0dd05670 in net_instaweb::ScopedMutex::ScopedMutex (this=0x7fdab01eece0, mutex=0x563970318368) at ./pagespeed/kernel/base/abstract_mutex.h:50
#5  0x00007fdb0de7dc41 in net_instaweb::MutexedScalar::AddHelper (this=0x563970306ee0, delta=1) at pagespeed/kernel/base/statistics.cc:90
#6  0x00007fdb0e569dfb in net_instaweb::UpDownTemplate<net_instaweb::SharedMemVariable>::AddHelper (this=0x563970306ed8, delta=1)
    at ./pagespeed/kernel/base/statistics_template.h:342
#7  0x00007fdb0de09f54 in net_instaweb::UpDownCounter::Add (this=0x563970306ed8, delta=1) at ./pagespeed/kernel/base/statistics.h:94
#8  0x00007fdb0e28466e in net_instaweb::NamedLockScheduleRewriteController::LockObtained (this=0x5639703f47e8, callback=0x7fdaa800e148, 
    key="rc:rname/ic_rtIOJOYF7lak15A4Fzbf/https://www.SITE1.no/wp-content/uploads/sites/88/2016/11/logo1.png@x@.mss_", named_lock=0x7fdaa8003278)
    at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:93
#9  0x00007fdb0e2880da in net_instaweb::MemberFunction3<net_instaweb::NamedLockScheduleRewriteController, net_instaweb::Function*, std::string const, net_instaweb::NamedLock*>::Run (this=0x7fdaa800eab8) at ./pagespeed/kernel/base/function.h:236
#10 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa800eab8) at pagespeed/kernel/base/function.cc:51
#11 0x00007fdb0e0166b8 in net_instaweb::SchedulerBasedAbstractLock::LockTimedWaitStealOld (this=0x7fdaa8003278, wait_ms=0, steal_ms=30000, callback=0x7fdaa800eab8)
    at pagespeed/kernel/thread/scheduler_based_abstract_lock.cc:204
#12 0x00007fdb0e284a10 in net_instaweb::NamedLockScheduleRewriteController::ScheduleRewrite (this=0x5639703f47e8, 
    key="rc:rname/ic_rtIOJOYF7lak15A4Fzbf/https://www.SITE1.no/wp-content/uploads/sites/88/2016/11/logo1.png@x@.mss_", callback=0x7fdaa800e148)
    at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:167
#13 0x00007fdb0e282ec2 in net_instaweb::(anonymous namespace)::ScheduleRewriteContextImpl::ScheduleRewriteContextImpl (this=0x7fdaa800e118, controller=0x5639703f47e8, 
    callback=0x7fdaa800e0c8) at pagespeed/controller/in_process_central_controller.cc:80
#14 0x00007fdb0e28329f in net_instaweb::InProcessCentralController::ScheduleRewrite (this=0x5639703f47c0, callback=0x7fdaa800e0c8)
    at pagespeed/controller/in_process_central_controller.cc:149
#15 0x00007fdb0de3165e in net_instaweb::RewriteContext::ObtainLockForCreation (this=0x563970452eb8, server_context=0x56396fcb4388, callback=0x7fdaa800dd38)
    at net/instaweb/rewriter/rewrite_context.cc:1440
#16 0x00007fdb0de313a7 in net_instaweb::RewriteContext::OutputCacheMiss (this=0x563970452eb8) at net/instaweb/rewriter/rewrite_context.cc:1408
#17 0x00007fdb0de31193 in net_instaweb::RewriteContext::OutputCacheDone (this=0x563970452eb8, cache_result=0x7fdaa80032e8) at net/instaweb/rewriter/rewrite_context.cc:1380
#18 0x00007fdb0de4618b in net_instaweb::MemberFunction1<net_instaweb::RewriteContext, net_instaweb::RewriteContext::CacheLookupResult*>::Run (this=0x7fdaa8005058)
    at ./pagespeed/kernel/base/function.h:171
#19 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa8005058) at pagespeed/kernel/base/function.cc:51
#20 0x00007fdb0e00bcd3 in net_instaweb::QueuedWorkerPool::Run (this=0x5639703d2d58, sequence=0x5639704124d8, worker=0x56397043d3b8)
    at pagespeed/kernel/thread/queued_worker_pool.cc:157
#21 0x00007fdb0e012ba6 in net_instaweb::MemberFunction2<net_instaweb::QueuedWorkerPool, net_instaweb::QueuedWorkerPool::Sequence*, net_instaweb::QueuedWorker*>::Run (
    this=0x56397042fbf8) at ./pagespeed/kernel/base/function.h:202
#22 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x56397042fbf8) at pagespeed/kernel/base/function.cc:51
#23 0x00007fdb0e0199d6 in net_instaweb::Worker::WorkThread::Run (this=0x56397042bef8) at pagespeed/kernel/thread/worker.cc:85
#24 0x00007fdb0e2ed33b in net_instaweb::PthreadThreadImpl::InvokeRun (self_ptr=0x56397042bfa8) at pagespeed/kernel/thread/pthread_thread_system.cc:121
#25 0x00007fdb162d6e25 in start_thread (arg=0x7fdab01f0700) at pthread_create.c:308
#26 0x00007fdb15e0034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

1.core

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fdb162d8dcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007fdb162d8c98 in __GI___pthread_mutex_lock (mutex=0x7fdac05e5e80) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fdb0e2eab6c in net_instaweb::(anonymous namespace)::PthreadSharedMemMutex::Lock (this=0x563970318368) at pagespeed/kernel/thread/pthread_shared_mem.cc:67
#4  0x00007fdb0dd05670 in net_instaweb::ScopedMutex::ScopedMutex (this=0x7fdab01eece0, mutex=0x563970318368) at ./pagespeed/kernel/base/abstract_mutex.h:50
#5  0x00007fdb0de7dc41 in net_instaweb::MutexedScalar::AddHelper (this=0x563970306ee0, delta=1) at pagespeed/kernel/base/statistics.cc:90
#6  0x00007fdb0e569dfb in net_instaweb::UpDownTemplate<net_instaweb::SharedMemVariable>::AddHelper (this=0x563970306ed8, delta=1)
    at ./pagespeed/kernel/base/statistics_template.h:342
#7  0x00007fdb0de09f54 in net_instaweb::UpDownCounter::Add (this=0x563970306ed8, delta=1) at ./pagespeed/kernel/base/statistics.h:94
#8  0x00007fdb0e28466e in net_instaweb::NamedLockScheduleRewriteController::LockObtained (this=0x5639703f47e8, callback=0x7fdaa8002c58, 
    key="rc:rname/cf_rtIOJOYF7lak15A4Fzbf/https://www.SITE2.no/wp-content/plugins/revslider/public/assets/css/settings.css?ver=201821101921@A.@.mssA_", named_lock=
    0x7fdaa8005f58) at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:93
#9  0x00007fdb0e2880da in net_instaweb::MemberFunction3<net_instaweb::NamedLockScheduleRewriteController, net_instaweb::Function*, std::string const, net_instaweb::NamedLock*>::Run (this=0x7fdaa8006ad8) at ./pagespeed/kernel/base/function.h:236
#10 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa8006ad8) at pagespeed/kernel/base/function.cc:51
#11 0x00007fdb0e0166ef in net_instaweb::SchedulerBasedAbstractLock::LockTimedWaitStealOld (this=0x7fdaa8005f58, wait_ms=0, steal_ms=30000, callback=0x7fdaa8006ad8)
    at pagespeed/kernel/thread/scheduler_based_abstract_lock.cc:209
#12 0x00007fdb0e284a10 in net_instaweb::NamedLockScheduleRewriteController::ScheduleRewrite (this=0x5639703f47e8, 
    key="rc:rname/cf_rtIOJOYF7lak15A4Fzbf/https://www.SITE2.no/wp-content/plugins/revslider/public/assets/css/settings.css?ver=201821101921@A.@.mssA_", 
    callback=0x7fdaa8002c58) at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:167
#13 0x00007fdb0e282ec2 in net_instaweb::(anonymous namespace)::ScheduleRewriteContextImpl::ScheduleRewriteContextImpl (this=0x7fdaa8004388, controller=0x5639703f47e8, 
    callback=0x7fdaa8005578) at pagespeed/controller/in_process_central_controller.cc:80
#14 0x00007fdb0e28329f in net_instaweb::InProcessCentralController::ScheduleRewrite (this=0x5639703f47c0, callback=0x7fdaa8005578)
    at pagespeed/controller/in_process_central_controller.cc:149
#15 0x00007fdb0de3165e in net_instaweb::RewriteContext::ObtainLockForCreation (this=0x563970569498, server_context=0x56396fcb4388, callback=0x7fdaa8000c38)
    at net/instaweb/rewriter/rewrite_context.cc:1440
#16 0x00007fdb0de313a7 in net_instaweb::RewriteContext::OutputCacheMiss (this=0x563970569498) at net/instaweb/rewriter/rewrite_context.cc:1408
#17 0x00007fdb0de31193 in net_instaweb::RewriteContext::OutputCacheDone (this=0x563970569498, cache_result=0x7fdaa80008c8) at net/instaweb/rewriter/rewrite_context.cc:1380
#18 0x00007fdb0de4618b in net_instaweb::MemberFunction1<net_instaweb::RewriteContext, net_instaweb::RewriteContext::CacheLookupResult*>::Run (this=0x7fdaa8000d58)
    at ./pagespeed/kernel/base/function.h:171
#19 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa8000d58) at pagespeed/kernel/base/function.cc:51
#20 0x00007fdb0e00bcd3 in net_instaweb::QueuedWorkerPool::Run (this=0x5639703d2d58, sequence=0x563970416188, worker=0x5639705273c8)
    at pagespeed/kernel/thread/queued_worker_pool.cc:157
#21 0x00007fdb0e012ba6 in net_instaweb::MemberFunction2<net_instaweb::QueuedWorkerPool, net_instaweb::QueuedWorkerPool::Sequence*, net_instaweb::QueuedWorker*>::Run (
    this=0x5639703efc18) at ./pagespeed/kernel/base/function.h:202
#22 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x5639703efc18) at pagespeed/kernel/base/function.cc:51
#23 0x00007fdb0e0199d6 in net_instaweb::Worker::WorkThread::Run (this=0x56397047a628) at pagespeed/kernel/thread/worker.cc:85
#24 0x00007fdb0e2ed33b in net_instaweb::PthreadThreadImpl::InvokeRun (self_ptr=0x56397047a6d8) at pagespeed/kernel/thread/pthread_thread_system.cc:121
#25 0x00007fdb162d6e25 in start_thread (arg=0x7fdab01f0700) at pthread_create.c:308
#26 0x00007fdb15e0034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

2.core

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007fdb0e2ed7d7 in net_instaweb::PthreadCondvar::Wait (this=0x5639702f6578) at pagespeed/kernel/thread/pthread_condvar.cc:44
#2  0x00007fdb0dfe27dd in net_instaweb::CheckingThreadSystem::CheckingCondvar::Wait (this=0x5639702f6418) at pagespeed/kernel/base/checking_thread_system.cc:50
#3  0x00007fdb0e0198dc in net_instaweb::Worker::WorkThread::GetNextTask (this=0x5639702f6338) at pagespeed/kernel/thread/worker.cc:64
#4  0x00007fdb0e0199e2 in net_instaweb::Worker::WorkThread::Run (this=0x5639702f6338) at pagespeed/kernel/thread/worker.cc:82
#5  0x00007fdb0e2ed33b in net_instaweb::PthreadThreadImpl::InvokeRun (self_ptr=0x56396fd665c8) at pagespeed/kernel/thread/pthread_thread_system.cc:121
#6  0x00007fdb162d6e25 in start_thread (arg=0x7fdaaf6c8700) at pthread_create.c:308
#7  0x00007fdb15e0034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

4.core

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fdb162d8dcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007fdb162d8c98 in __GI___pthread_mutex_lock (mutex=0x7fdac05e5e80) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007fdb0e2eab6c in net_instaweb::(anonymous namespace)::PthreadSharedMemMutex::Lock (this=0x563970318368) at pagespeed/kernel/thread/pthread_shared_mem.cc:67
#4  0x00007fdb0dd05670 in net_instaweb::ScopedMutex::ScopedMutex (this=0x7fdab01eece0, mutex=0x563970318368) at ./pagespeed/kernel/base/abstract_mutex.h:50
#5  0x00007fdb0de7dc41 in net_instaweb::MutexedScalar::AddHelper (this=0x563970306ee0, delta=1) at pagespeed/kernel/base/statistics.cc:90
#6  0x00007fdb0e569dfb in net_instaweb::UpDownTemplate<net_instaweb::SharedMemVariable>::AddHelper (this=0x563970306ed8, delta=1)
    at ./pagespeed/kernel/base/statistics_template.h:342
#7  0x00007fdb0de09f54 in net_instaweb::UpDownCounter::Add (this=0x563970306ed8, delta=1) at ./pagespeed/kernel/base/statistics.h:94
#8  0x00007fdb0e28466e in net_instaweb::NamedLockScheduleRewriteController::LockObtained (this=0x5639703f47e8, callback=0x7fdaa8002e78, 
    key="rc:rname/cf_rtIOJOYF7lak15A4Fzbf/https://SITE3.no/wp-content/plugins/woocommerce/assets/css/woocommerce-layout.css?ver=201762116833@A.@amssA_", 
    named_lock=0x7fdaa8004488) at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:93
#9  0x00007fdb0e2880da in net_instaweb::MemberFunction3<net_instaweb::NamedLockScheduleRewriteController, net_instaweb::Function*, std::string const, net_instaweb::NamedLock*>::Run (this=0x7fdaa8005078) at ./pagespeed/kernel/base/function.h:236
#10 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa8005078) at pagespeed/kernel/base/function.cc:51
#11 0x00007fdb0e0166b8 in net_instaweb::SchedulerBasedAbstractLock::LockTimedWaitStealOld (this=0x7fdaa8004488, wait_ms=0, steal_ms=30000, callback=0x7fdaa8005078)
    at pagespeed/kernel/thread/scheduler_based_abstract_lock.cc:204
#12 0x00007fdb0e284a10 in net_instaweb::NamedLockScheduleRewriteController::ScheduleRewrite (this=0x5639703f47e8, 
    key="rc:rname/cf_rtIOJOYF7lak15A4Fzbf/https://SITE3.no/wp-content/plugins/woocommerce/assets/css/woocommerce-layout.css?ver=201762116833@A.@amssA_", 
    callback=0x7fdaa8002e78) at pagespeed/controller/named_lock_schedule_rewrite_controller.cc:167
#13 0x00007fdb0e282ec2 in net_instaweb::(anonymous namespace)::ScheduleRewriteContextImpl::ScheduleRewriteContextImpl (this=0x7fdaa8002de8, controller=0x5639703f47e8, 
    callback=0x7fdaa80009e8) at pagespeed/controller/in_process_central_controller.cc:80
#14 0x00007fdb0e28329f in net_instaweb::InProcessCentralController::ScheduleRewrite (this=0x5639703f47c0, callback=0x7fdaa80009e8)
    at pagespeed/controller/in_process_central_controller.cc:149
#15 0x00007fdb0de3165e in net_instaweb::RewriteContext::ObtainLockForCreation (this=0x56397043f6f8, server_context=0x56396fcb4388, callback=0x7fdaa8000a38)
    at net/instaweb/rewriter/rewrite_context.cc:1440
#16 0x00007fdb0de313a7 in net_instaweb::RewriteContext::OutputCacheMiss (this=0x56397043f6f8) at net/instaweb/rewriter/rewrite_context.cc:1408
#17 0x00007fdb0de31193 in net_instaweb::RewriteContext::OutputCacheDone (this=0x56397043f6f8, cache_result=0x7fdaa80008c8) at net/instaweb/rewriter/rewrite_context.cc:1380
#18 0x00007fdb0de4618b in net_instaweb::MemberFunction1<net_instaweb::RewriteContext, net_instaweb::RewriteContext::CacheLookupResult*>::Run (this=0x7fdaa8000d58)
    at ./pagespeed/kernel/base/function.h:171
#19 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x7fdaa8000d58) at pagespeed/kernel/base/function.cc:51
#20 0x00007fdb0e00bcd3 in net_instaweb::QueuedWorkerPool::Run (this=0x5639703d2d58, sequence=0x563970412a58, worker=0x5639704b9378)
    at pagespeed/kernel/thread/queued_worker_pool.cc:157
#21 0x00007fdb0e012ba6 in net_instaweb::MemberFunction2<net_instaweb::QueuedWorkerPool, net_instaweb::QueuedWorkerPool::Sequence*, net_instaweb::QueuedWorker*>::Run (
    this=0x5639704311c8) at ./pagespeed/kernel/base/function.h:202
#22 0x00007fdb0dfe709a in net_instaweb::Function::CallRun (this=0x5639704311c8) at pagespeed/kernel/base/function.cc:51
#23 0x00007fdb0e0199d6 in net_instaweb::Worker::WorkThread::Run (this=0x5639704e2ac8) at pagespeed/kernel/thread/worker.cc:85
#24 0x00007fdb0e2ed33b in net_instaweb::PthreadThreadImpl::InvokeRun (self_ptr=0x5639704e2b78) at pagespeed/kernel/thread/pthread_thread_system.cc:121
#25 0x00007fdb162d6e25 in start_thread (arg=0x7fdab01f0700) at pthread_create.c:308
#26 0x00007fdb15e0034d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
oschaaf commented 6 years ago

Anything suspicious in the logs that you can correlate?

In any case, the code involving manipulation of locks_currently_held_ in NamedLockScheduleRewriteController looks tricky / complex to me. Especially when manipulation of the counter happens simultaneously with other child processes shutting down / restarting.

Looking at it, I think manipulation of locks_currently_held_ can be safely commented out. You could try if patching in the following eliminates the deadlocks: https://github.com/apache/incubator-pagespeed-mod/compare/master...oschaaf-avoid-locks-currently-held

Longer term, I'm thinking that maybe we want to keep the statistics thread local for manipulation, and sync with shared mem ones (for internal use and display in the admin pages) periodically.

Dead2 commented 6 years ago

We do not have a snapshot of the logs from this deadlock unfortunately.

We did consider turning statistics off completely, but that would cause ImageMaxRewritesAtOnce to be ignored and potentially turn into a DOS attack vector.

I do wish there was a setting for disabling all "fluff" statistics, only keeping the few things needed for communication between processes. Investigating this also made me wonder why ImageMaxRewritesAtOnce is global while NumRewriteThreads and NumExpensiveRewriteThreads are process-local (so global if using a threaded apache and per-connection if using prefork). So this might be something worth changing long-term too.

But I am confused, the documentation says NumRewriteThreads and NumExpensiveRewriteThreads are per-process, so those should not be able to cause a deadlock across processes. And ImageMaxRewritesAtOnce is global, so it could cause this, but two of the processes above are trying to rewrite .css files and should thus not be affected. So, either I am missing something or the documentation is misleading, probably the former. :)

We will try out your patch, but proving the problem is fixed can take a while, since it can take weeks to trigger the bug (sometimes twice in a few days, sometimes once in a month, so it seems very random).

oschaaf commented 6 years ago

It would be great if you can try out the patch and report back if the issue seems to be gone with it. It's probably best if that takes a while :-)

Reading the rest of your comment: This is still experimental and undocumented, but you may find the following pretty interesting: https://github.com/apache/incubator-pagespeed-mod/wiki/Design-Doc:-Central-Controller. The feature is actually around in the latest stable, and can be enabled by setting the central controller port:

ModPagespeedExperimentalCentralControllerPort 12345
morlovich commented 6 years ago

The stats code should never hold more than one lock, so I can't see how deadlocks would happen. Any chance you had some process crashes happen?

On Tue, Jun 5, 2018 at 7:55 AM Otto van der Schaaf notifications@github.com wrote:

It would be great if you can try out the patch and report back if the issue seems to be gone with it. It's probably best if that takes a while :-)

Reading the rest of your comment: This is still experimental and undocumented, but you may find the following pretty interesting: https://github.com/apache/incubator-pagespeed-mod/wiki/Design-Doc:-Central-Controller. The feature is actually around in the latest stable, and can be enabled by setting the central controller port:

ModPagespeedExperimentalCentralControllerPort 12345

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-pagespeed-mod/issues/1775#issuecomment-394681707, or mute the thread https://github.com/notifications/unsubscribe-auth/ADl1RIorJNKOe_Zo9Oemr-JMaYdQDEixks5t5nGrgaJpZM4UaisD .

jmarantz commented 6 years ago

I've always had concerns about using mod_pagespeed with ITK because, IIUC, ITK shuts down processes after the request completes, where MPS wants to keep running to complete fetches and optimize resources in background threads.

So what happens when ITK kills a child process with MPS running fetches and optimizations in background threads?

Nothing good I think.

Http fetches will be aborted mid-stream and have to be repeated. Also image optimizations. This is a waste of cpu and network bandwidth both in MPS and the origin server. LoadFromFile can help here, but will not fully solve the problem.

Eventually optimizations may land in cache and the system will start providing some benefit.

But I think depending on the timing a subprocess could in theory be killed while holding a cross-process lock, resulting in the deadlock you see.

I wonder if there are ways to tell ITK to allow processes to linger for a few hundred requests so MPS can get some work done? That would be the best remedy.

Another option is to give MPS very long deadlines to complete its optimizations. Say, 5 or 10 seconds. Then MPS will block delivery of HTML until the page is fully optimized, and reduce the likelihood that MPS will be active, taking locks, when ITK kills the child process. The drawback is long latency for HTML for the first visitor, with a cold server cache.

Dead2 commented 6 years ago

@morlovich No crashes in relation to this unfortunately.

@jmarantz That is indeed a concern with mpm-itk. But having it live through several connections is impossible, since it cannot re-acquire root permissions once it has given them up. So it has to be a fresh process for each connection. What would help was if it was possible to tell mpm-itk to wait until pagespeed is done processing data before ending the process. I'm not sure how mpm-itk ends the process, but shouldn't mod_pagespeed ideally get notified and be able to release the lock before going away?

An alternative would be to spawn a process(instead of a thread) for handling background rewrites and cache maintenance etc (possibly with the optimization that only one process per vhost exist, and that process gets launched on demand and optionally quits after X minutes idle.).

This problem cannot easily be solved by a central daemon/process, since it requires permissions to access the various users files, and permanently running complex code as root is not wanted.

oschaaf commented 6 years ago

@Dead2 To float an idea, how about running a proxy with a processing model more suitable for pagespeed in front of the webserver?

jmarantz commented 6 years ago

I don't think a notification strategy would help; MPS doesn't hold stats-locks for very long; there's no message handling loop during a stats update to receive such a notification. Example code:

https://github.com/apache/incubator-pagespeed-mod/blob/e7f30337cc33d4f9b8873b0b0b073a8f17706bdd/pagespeed/kernel/base/statistics.cc#L105

Spawning a process for background optimizations would be a pretty complex architectural change and I doubt anyone has the time to do it., though actually I've considered that in the past as an option given the new controller process.

I really don't know much about ITK but I thought maybe processes could be made to linger for a while handling the requests for a single user, without having to revert to root. I'll consider that avenue dead, but still propose extending MPS's deadlines to make the failure scenario less likely. That's the easiest thing to try because it's just tweaking a couple of pagespeed configuration options.

But I really think -- without extensive new development -- @oschaaf's idea of putting a proxy (e.g. https://github.com/We-Amp/ats_pagespeed or ngx_pagespeed) downstream of the ITK-enabled server, is the best one. You'd have to set it up to proxy all the different hosts handled by Apache though, which would of course need to be maintained.

More food for thought for the future: PageSpeed's usage of shared-memory mutexes for stats is not the only solution. Envoy Proxy also has shared-mem stats but uses atomics. That doesn't address histograms or PageSpeed's shared-memory metadata cache, but it might address the most common case of cross-process deadlocks when used with ITK.

Dead2 commented 6 years ago

@oschaaf This has been considered a few times as a workaround, but due to the way things are, it would mean having two layers of proxies, so it would further complicate things. Also, it would not be able to load files directly instead of doing http requests.

jmarantz commented 6 years ago

What's the other proxy? Or is the ITK-enabled Apache also a proxy?

Are you using LoadFromFile now?

I actually don't see why it shouldn't be LoadFromFile-enabled itself, if all the files are available locally.

Dead2 commented 6 years ago

The other proxy is an Nginx one for load-balancing and caching without touching the application server, and we don't want mod_pagespeed at that level. Thus it would have to be a middle-proxy if anything.

LoadFromFile is in use now, and no other user than the owner is allowed access to the files/folders as a security measure, thus the need for the whole apache process to change user.

But it seems to me that we don't have any concrete evidence that mpm-itk kills a process prematurely. I had a look into the code, but I don't know the apache code well enough to be sure of anything, but from what I saw, there are very few cases where mpm-itk will kill the process (error handling mostly).

But if mod_pagespeed can deadlock due to a killed/crashed apache child process, that in itself is not good, is it? Sounds like it needs some way to detect the deadlock and either recover or somehow notify the rest of the system of the situation instead of resulting in a DOS-situation for the whole server. At the very least, a timeout on the lock, and fail the connection instead of waiting endlessly for something that should take <1ms. Ideally it would know what pid holds the lock, and print a warning to that effect to ease debugging, or check whether that pid still exists and remove the lock if the pid is gone.

oschaaf commented 6 years ago

It may be possible to improve robustness regarding this issue.. but considering that no other reports have been filed, hitting it is probably super-rare with the standard processing models.

For completeness, posting one other report I remember, which may have been hitting the same issue: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/pagespeed-dev/54rKRQnZheE/ySk5W6IlBAAJ

(Note: I suspect this report actually involved a customized apache and/or mod_pagespeed)

Dead2 commented 6 years ago

Apache supposedly has hooks to prevent shutdown of the process until dependent tasks have finished, does mod_pagespeed register with these at all? Does mod_pagespeed register any atexit() code for example? I'd guess this is probably the most robust solution to this problem and to handle the case of killed processes for whatever reason.

Apache connection-process exit code:

  ap_run_process_connection(c);
  ap_lingering_close(c);
  exit(0);
jmarantz commented 6 years ago

That certainly sounds interesting. I suppose if the process shutdowns are graceful that could work. But it also may result in a lot of lingering processes, and could result in a lot of server load depending on timing.

Short term with existing code in place I would use the deadline controls we have in MPS already to try too improve things. Does that seem reasonable?

On Tue, Jun 12, 2018, 3:17 AM Hans Kristian Rosbach < notifications@github.com> wrote:

Apache supposedly has hooks to prevent shutdown of the process until dependent tasks have finished, does mod_pagespeed register with these at all? Does mod_pagespeed register any atexit() code for example? I'd guess this is probably the most robust solution to this problem and to handle the case of killed processes for whatever reason.

Apache connection-process exit code:

ap_run_process_connection(c); ap_lingering_close(c); exit(0);

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-pagespeed-mod/issues/1775#issuecomment-396490225, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2kPShY-iB_62-9oUNDHombYDG07HHfks5t72rvgaJpZM4UaisD .

Dead2 commented 6 years ago

Increasing the deadlines to allow blocking rewrites is unfortunately not acceptable in our case. We are testing out the patch posted earlier that removes the locking, it has been running on a single server for about a week now without any new crashes (that pretty much only means it didn't make things worse though).

A thought for long-term fix might be to either have a global limit of how many processes can linger and do cleanup/rewrites in the background, or set a timelimit for how long each it can delay an exit before it is forced.

oschaaf commented 6 years ago

I think that the patch may improve odds, but it's not going to reliably solve the problem. The code all looks good afaict, and that makes me think that the scenario @jmarantz suggested, a process killed while holding a lock, may be a realistic explanation.

Also, I worked on some shutdown stuff for ngx_pagespeed [1], which I think actually implemented some of what you suggest. Perhaps it is translatable to the httpd world..

One thing I remember however, is that the stock http fetcher, Serf, is not always able to cancel outstanding fetches in a timely manner. This could still cause trouble. Maybe moving the fetching over to the experimental central controller process could be part of a solution to that. With the image optimization and the fetching happening in another process, we should be able to shut down consistently quickly in all cases.

https://github.com/apache/incubator-pagespeed-ngx/commit/d7f1c0dc4817f153da29ced58882cb0bc9787c2c