Apache stuck indefinitely waiting for PSOL

GoogleCodeExporter commented 9 years ago

I just installed mod_pagespeed on my centos 7 and got tons of httpd errors log 
in 1 minute. An example line of error:
<code>
[Tue Feb 10 11:05:14.311755 2015] [pagespeed:warn] [pid 21132:tid 
139634310850304] [mod_pagespeed 1.9.32.3-4448 @21132] Waiting for completion of 
URL http://exampledomain.com/example-slug/ for 45.001 sec
</code>

ALL requests got error, include image requests too.

My server hardware specs:
* Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz, 8 cores
* 32 GB DDR3 RAM
* 2 x 2 TB SATA 6 Gb/s 7200 rpm HDD (Software-RAID 1) Class Enterprise

Software specs:
Operating system: CentOS Linux 7.0.1406
Kernel: Linux 3.10.0-123.20.1.el7.x86_64 on x86_64

Server Version: Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips mod_fcgid/2.3.9 
PHP/5.6.5 mod_perl/2.0.9dev Perl/v5.16.3
Server MPM: event

What version of the product are you using (please check X-Mod-Pagespeed
header)?
mod-pagespeed-stable-1.9.32.3-4448.x86_64

URL of broken page:
I removed module after 1 minute of terror. If a google developer want to learn 
more of server information, mail me an ip address that i can give permission to 
look mod_info print.

Original issue reported on code.google.com by unsalkor...@gmail.com on 10 Feb 2015 at 9:35

jeffkaufman commented 9 years ago

I don't think this is the problem, but looking at instaweb_in_place_filter I don't think it's quite right. It needs to be prepared to be called multiple times, in case it's processing output from something that's intermittent, but the handling of first and calling ConsiderResponseHeaders looks like it expects to be called only once. Specifically, recorder->ConsiderResponseHeaders() could be called on each bucket brigade instead of only on the first one.

This was added by @morlovich in https://github.com/pagespeed/mod_pagespeed/commit/051b2f91a15b6f1acba8e551e85750e51df5bc96#diff-f315ba73469d44612ff2edcef2c1ce64 , but I don't see how it could be causing the current problem.

eldk commented 9 years ago

Hello,

I haven't try to reproduce it with Firefox.

But for the message that are repeted many times from 5s to x s, I notice that the UA is always (9/10 maybe more)for chrome and/or safari.

@jeffkaufman so perhaps this is the reason that it can't be reproduce with CURL. Nothing to do with preload and optimization from browser ?

https://groups.google.com/forum/#!topic/mod-pagespeed-discuss/Fh-phRTPBP8

eldk commented 9 years ago

If you go there : A - http://www.loband.org/loband/simulator.jsp and have a try in firefox. 1 - you have only messages when image is rewrited (not exist in the mod_pagespeed cache) 2 - when image is yet rewrited and in mod_pagespeed cache: no message at all (make more try and have sometime one warning message for 5 seconds waiting), but nether long flow of messages.

B - and do the try with Chrome and low bandwith : 1 - flow of messages (>2) are there when the image is rewrited 2 - flow of messages (>2) are there (everytime) if the image has yet been rewrited and is in mod_pagespeed cache.

addentum : B-2 should be solved in first place, because A-2 could be produced by B-2 addentum 2 : "grepping" the log out from "waiting ...", I found a few : [error] [mod_pagespeed 1.9.32.3-4448 @22484] ServerContext: 1 leaked_rewrite_drivers on destruction. I don't know if it's help.

jeffkaufman commented 9 years ago

To test instaweb_in_place_filter on being called multiple times I set up my Apache config as:

ProxyPass /proxy http://www.jefftk.com
ProxyPassReverse /proxy http://www.jefftk.com

And then looked at localhost:8080/proxy/test/quiberville-plage-est.jpg?cachebuster=somethingrandom

This did make Apache work over multiple buckets, calling instaweb_in_place_filter repeatedly, but even with --limit-rate all the buckets first move through PageSpeed's instaweb_in_place_filter and so also ApacheFetch quickly before slowly being sent out to the browser.

jeffkaufman commented 9 years ago

I've set an outgoing bandwidth limit on http://www.jefftk.com/test*, but I'm seeing Done() called on the ApacheFetch called almost immediately, before all the data gets to the client.

I'm also seeing what might be the same bug or might be different. To reproduce:

1) Set up PageSpeed as proxying, as shown above 2) start fetching localhost:8080/proxy/test/quiberville-plage-est.jpg?a=somethingrandom 3) kill the fetch after it's partially downloaded 4) fetch it again 5) observe that you receive a truncated file 6) in fact, the cache is poisoned with this truncated file, and we'll serve it out every time

I haven't found the culprit in the code yet, but it sounds like we're recording into our cache as if we've done things successfully when we should be aborting the recording.

morlovich commented 9 years ago

That sounds like we're ignoring a Done(false) (or not producing one!)

On Wed, May 13, 2015 at 10:40 AM, Jeff Kaufman notifications@github.com wrote:

I've set an outgoing bandwidth limit on http://www.jefftk.com/test*, but I'm seeing Done() called on the ApacheFetch called almost immediately, before all the data gets to the client.

I'm also seeing what might be the same bug or might be different. To reproduce:

1) Set up PageSpeed as proxying, as shown above 2) start fetching localhost:8080/proxy/test/quiberville-plage-est.jpg?a=somethingrandom 3) kill the fetch after it's partially downloaded 4) fetch it again 5) observe that you receive a truncated file 6) in fact, the cache is poisoned with this truncated file, and we'll serve it out every time

I haven't found the culprit in the code yet, but it sounds like we're recording into our cache as if we've done things successfully when we should be aborting the recording.

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-101690973 .

morlovich commented 9 years ago

Also perhaps instaweb_in_place_filter might be missing come cases to call recorder->Fail?

On Wed, May 13, 2015 at 10:45 AM, Maksim Orlovich morlovich@google.com wrote:

That sounds like we're ignoring a Done(false) (or not producing one!)

On Wed, May 13, 2015 at 10:40 AM, Jeff Kaufman notifications@github.com wrote:

I've set an outgoing bandwidth limit on http://www.jefftk.com/test*, but I'm seeing Done() called on the ApacheFetch called almost immediately, before all the data gets to the client.

I'm also seeing what might be the same bug or might be different. To reproduce:

1) Set up PageSpeed as proxying, as shown above 2) start fetching localhost:8080/proxy/test/quiberville-plage-est.jpg?a=somethingrandom 3) kill the fetch after it's partially downloaded 4) fetch it again 5) observe that you receive a truncated file 6) in fact, the cache is poisoned with this truncated file, and we'll serve it out every time

I haven't found the culprit in the code yet, but it sounds like we're recording into our cache as if we've done things successfully when we should be aborting the recording.

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-101690973 .

jeffkaufman commented 9 years ago

Looking at the apache code, I don't currently see anyway for instaweb_in_place_check_headers_filter to know that the request got aborted by the user and so should not be cached.

Separately, it's minorly wrong that instaweb_in_place_check_headers_filter expects that there can only be one EOS bucket. There should be, but "ignore any buckets after the first EOS bucket" is something all modules are supposed to do.

jeffkaufman commented 9 years ago

Debugging this I found https://github.com/pagespeed/mod_pagespeed/issues/1078

eldk commented 9 years ago

Hello,

I've disabled :

jw Convert Jpeg To Webp
io In-place optimize for browser
rw Recompress Webp

And wait to see if long lasting warning messages still occure except on first cache write.

eldk commented 9 years ago

With those 3 options disabled. Warning messages appears only when optimized jpg is created and should last. After that, no more warnings for Firefox nor Chrome even with a slow bandwidth, image is displayed faster too for Chrome (to the speed of a slow connexion).

I don't know why but every time apache is restarted, cached image are regenerated. I don't know why this behavior. I'd prefer that images that have yet been cached are used again, even if apache is restarted. Is it a feature ? Thanks, Eric

PS : @jeffkaufman I have send you a screen print : for Chrome, no more long lasting messages on each request from browser if the image is in cache, sometime a 5 sec waiting message (3x 5 sec messages for 6 x call to the image) .

jeffkaufman commented 9 years ago

I don't know why but every time apache is restarted, cached image are regenerated. I don't know why this behavior. I'd prefer that images that have yet been cached are used again, even if apache is restarted.

This one I understand! You have an explicitly configured shared memory metadata cache (shmmc) enabled. By default there's a shmmc that is write-through, which means writes are slower because they have to go to disk but they also persist across restarts. If you enable one with explicit configuration, as you have, then we turn off write-through for faster writes but we lose persistence.

Yes, this is confusing (though it is documented) and I have a draft change to fix it up that switches us to only write cache snapshots to disk. But that draft change is stuck behind fixing this bug and a couple others.

eldk commented 9 years ago

This one I understand! You have an explicitly configured shared memory metadata cache (shmmc) enabled. By default there's a shmmc that is write-through, which means writes are slower because they have to go to disk but they also persist across restarts. If you enable one with explicit configuration, as you have, then we turn off write-through for faster writes but we lose persistence.

Ok, that was the rname folder that have disappeared.

So disabling shmmc will give persistance. Now I understand too. On restart, apache frees memory.

Thanks.

eldk commented 9 years ago

Hello, I've disabled ModPagespeedCreateSharedMemoryMetadataCache and enabled convert_jpeg_to_webp,recompress_webp,in_place_optimize_for_browser again. Thank you, Eric

eldk commented 9 years ago

And It's really better : jpg images are now rewrited only one time when they are not in the cache. After being rewrited to webp, every request is served with the cached image.

So I think there it is another bug : "when ModPagespeedCreateSharedMemoryMetadataCache is used, a jpg to webp conversion is thrown on every request".

I don't know if memcached should change this, but this is another case.

jeffkaufman commented 9 years ago

when ModPagespeedCreateSharedMemoryMetadataCache is used, a jpg to webp conversion is thrown on every request

Could you say more about this? What about your current setup was telling you it was making webp conversions every time?

eldk commented 9 years ago

Hello,

I've disabled ModPagespeedCreateSharedMemoryMetadataCache.

And now on second access and later (Third (optimized resource in cache)), to a still cached image in webp, I have no more "waiting for completion ..." long messages that last more than 2 messages/image .This morning for an image accessed more than 20 times, after it has been rewrited, I have only one time two warning messages .

Before, with ModPagespeedCreateSharedMemoryMetadataCache enabled, it was more than 9/10 access that throw "waiting ..." long messages, more than 2 messages/image as it was rewriting it (Third (optimized resource in cache)). This occured only with webp image.

CASE A ->Is ModPagespeedCreateSharedMemoryMetadataCache enabled ; YES ->are we in Third state (optimized resource in cache) : yes -> should we send a webp image : yes->send and throw a long sequence of "Warning ..." as if you are rewriting image thus it's in cache (more than 9/10).

CASE B ->Is ModPagespeedCreateSharedMemoryMetadataCache enabled ; NO->are we in Third state (optimized resource in cache) : yes -> should we send a webp image : yes-> Send the webp image and throw max 2 warning message sequence on some circonstances (less than 1/10 sending)

thanks,

Eric

eldk commented 9 years ago

CASE C ->Is ModPagespeedCreateSharedMemoryMetadataCache enabled ; NO->are we in Third state (optimized resource in cache) : yes->-> should we send a webp image : NO->Should we send a jpg image: yes -> send the jpg image and throw 4 warnings. Circunstances : UA -Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2) Client connexion : supposed to be slow (I've a lot of visitors from low-bandwith country - 25%) Time image served : low-bandwith-ip - - [14/May/2015:14:11:37 +0200] "GET /IMG/truite_et_saumon_atlantique.jpg HTTP/1.1" 200 506653 "http://www.bing.com/images/search?q=LE+POISSON+SAUMON&qs=ds&form=QBIR" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2)" Error logs : [Thu May 14 14:11:42 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @27928] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 5.009 sec [Thu May 14 14:11:47 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @27928] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 10.128 sec [Thu May 14 14:11:52 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @27928] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 15.134 sec [Thu May 14 14:11:57 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @27928] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 20.134 sec Images in cache : last modification time : 506530 mai 14 10:09 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.3xrlA59JejXVZ89m41dU.jpg, 348170 mai 14 04:27 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.Bejme5ak9LXVZ89m41dU.jpg, 307837 mai 14 09:44 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.q-FNdkJK56XVZ89m41dU.webp, 393479 mai 14 03:36 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.T5NGePK7znXVZ89m41dU.webp, same with last access time : 506530 mai 14 10:34 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.3xrlA59JejXVZ89m41dU.jpg, 307837 mai 14 09:45 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.q-FNdkJK56XVZ89m41dU.webp, 348170 mai 14 06:10 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.Bejme5ak9LXVZ89m41dU.jpg, 393479 mai 14 03:36 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.T5NGePK7znXVZ89m41dU.webp,

eldk commented 9 years ago

CASE D #1081 ? ->Is ModPagespeedCreateSharedMemoryMetadataCache enabled ; NO->are we in Third state (optimized resource in cache) : yes->-> should we send a webp image : yes-> send the webp image-> is the load cancelled, stopped before complete load of image : yes-> sending is not killed and throw a bunch of warnings until process is killed by FCGID . Circunstances : UA - iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) GSA/5.2.43972 Mobile/12F70 Safari/600.1.4 Client connexion : GPRS/3G/4G or wifi (guess from ip) Time image served : ip - - [14/May/2015:16:03:25 +0200] "GET /IMG/truite_et_saumon_atlantique.jpg HTTP/1.1" 200 186792 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) GSA/5.2.43972 Mobile/12F70 Safari/600.1.4" Begin of warnings : [Thu May 14 16:03:31 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @2888] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 5 sec End of warnings : [Thu May 14 16:10:07 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @2888] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 401.066 sec Images in cache : Last modification time : 506530 mai 14 10:09 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.3xrlA59JejXVZ89m41dU.jpg, 348170 mai 14 04:27 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.Bejme5ak9LXVZ89m41dU.jpg, 307837 mai 14 09:44 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.q-FNdkJK56XVZ89m41dU.webp, 393479 mai 14 03:36 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.T5NGePK7znXVZ89m41dU.webp, Last access time : 506530 mai 14 10:34 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.3xrlA59JejXVZ89m41dU.jpg, 307837 mai 14 09:45 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.q-FNdkJK56XVZ89m41dU.webp, 348170 mai 14 06:10 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.Bejme5ak9LXVZ89m41dU.jpg, 393479 mai 14 03:36 xtruite_et_saumon_atlantique.jpg.pagespeed.ic.T5NGePK7znXVZ89m41dU.webp,

Nota :

1 - Another access has been made from another client after the beginning of the CASE D and before ending of CASE D with success but send the jpg cached image: XDSL IP - - [14/May/2015:16:06:18 +0200] "GET /IMG/truite_et_saumon_atlantique.jpg HTTP/1.1" 200 506659 "-" "Mozilla/5.0 (Linux; Android 4.4.2; S1052 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Safari/537.36 GSA/3.3.12.1106182.arm" One warning was throwned : [Thu May 14 16:06:02 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @3175] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 5 sec

2 - not sure that it's related - next first access to the same image after case D and Nota-1 throw one warning but send good image : XDSL IP - - [14/May/2015:16:43:47 +0200] "GET /IMG/truite_et_saumon_atlantique.jpg HTTP/1.1" 200 393608 "http://www.google.fr/imgres?imgurl=http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg&imgrefurl=http://www.opalesurfcasting.net/la_faune_aquatique/la_truite_de_mer_-_salmo_trutta_trutta_article1194.html&h=1125&w=1687&tbnid=lQEFiCFvKhea8M:&zoom=1&tbnh=90&tbnw=135&usg=__Ah3FQsSlPMgw-wnfrzHqfni4D0E=&docid=-PqeWmAXkSWWkM&client=aff-maxthon-newtab" "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.4.3000 Chrome/30.0.1599.101 Safari/537.36" One warning was throwned : [Thu May 14 16:43:52 2015] [warn] [mod_pagespeed 1.9.32.3-4448 @6129] Waiting for completion of URL http://www.opalesurfcasting.net/IMG/truite_et_saumon_atlantique.jpg for 5 sec

eldk commented 9 years ago

As what I see now, case D is the first big concern : On heavy load more and more cancelled requests and more and more waiting + cache in construction. Things are degrading rapidly + legally (not ever) hotlinked images with timeout + human cancelled load ... I will try to build the cache (IPRO) this night from local requests.

eldk commented 9 years ago

Hello,

The image cache is still building, I do it slowly. But things are better, less canceled image request (direct access to fullsized image) and a bunch less of warnings.

I think that with #1081 it will be perfect.

Thank you,

Eric

jeffkaufman commented 9 years ago

Ok: I think I understand this now. There are two cases where we'll emit this message, one mostly benign and the other because of a kernel bug.

When your cache is slow to respond (slow memcached, file cache on nfs) PageSpeed will temporarily print out Waiting for completion of URL messages. To fix this, move your cache to tmpfs or figure out how to get your memcached server to be more responsive.
If you're running a linux kernel that has https://github.com/torvalds/linux/commit/b0c29f79ecea0b6fbcefc999e70f2843ae8306db but not https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0 then there's a race condition in the linux futex code that underlies our condition variables, where even though the code correctly signals us the kernel keeps us waiting. In this case PageSpeed will permanently [1] print out Waiting for completion of URL messages. To fix this, upgrade your kernel.

More discussion about the futex issue: https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 and https://news.ycombinator.com/item?id=9542548 . It seems like it's most common on Haswell processors, though people have seen it on others.

[1] I think in some cases (not stock apache) people have modules that will clean up threads if they get stuck and take too long, so you might only see these messages for two minutes or something.

crowell commented 9 years ago

@Cyriltra what is your kernel version? can you paste the output of uname -a

jeffkaufman commented 9 years ago

Offline discussion with @jmaessen: the ApacheFetch::Wait() code is probably not vulnerable to the futex bug because of the way it uses TimedWait and looping. We do have other places where we use ordinary Wait, though, which are vulnerable. For example, Worker::WorkThread::GetNextTask calls Wait to wait for there to be some work on its queue, and if that doesn't wake up then our callbacks will never run and we'll get stuck here.

jeffkaufman commented 9 years ago

There are two cases where we'll emit this message, one mostly benign and the other because of a kernel bug.

Looking at people's reports and looking into which systems have the kernel bug there are some systems that don't have the bug that do get stuck indefinitely on Waiting for completion of URL. All of these are using memcached, so I'm going to look deeper into that code.

capn3m0 commented 9 years ago

Same problem here. We use pagespeed from 2012 and we are experiencing problem form the last 5 days. We made an update of all the system (yum update) and after it something is change.. We use the last version of pagespeed with memcached and from this update all the server (Apache 2.2 MPM Worker) become instable. This is how our server dies after the last update. ovh_grafico

We don't use nothing of this:

admin-page handling (which I think is not at issue here based on the log messages)
ModPagespeedInPlaceResourceOptimization (on by default starting in 1.9)
ModPagespeedMapProxyDomain

and after this problem we tried to insert

ModPagespeedInPlaceResourceOptimization Off

but nothing change.

Every little rewrite now ask for a lot of CPU and at the moment we set Off Pagespeed for about 90% of the sites (97 total sites on the server).

The error in the log are the same of the others users. The main problem is the CPU high load but also a problem with memcached. With the begin of this problems also memcached start to have problem with a lot of message like this: aprMemCache::Put error: Could not find specified socket in poll list

I tried to remove the last stable version and reinstall the last 1.8.x stable but the problem persist. I also tried to reinstall Apache 2.2, memcached, apc and all the "components" of the webserver.

I'm not expert but i think that is a bug related with kernel or with a library or similar the it use for the rewriting.

We use CentOS release 6.6 (Final) 2.6.32-504.16.2.el6.x86_64

This is the list of the last updates of my system before starting having problem with pagespeed: nss-softokn-freebl-3.14.3-22.el6_6.i686 gio 30 apr 2015 00:34:29 CEST man-pages-overrides-6.6.3-2.el6.noarch gio 30 apr 2015 00:34:29 CEST xorg-x11-drv-ati-firmware-7.3.99-2.el6.noarch gio 30 apr 2015 00:34:28 CEST strace-4.5.19-1.19.el6.x86_64 gio 30 apr 2015 00:34:28 CEST rsync-3.0.6-12.el6.x86_64 gio 30 apr 2015 00:34:28 CEST alsa-utils-1.0.22-9.el6_6.x86_64 gio 30 apr 2015 00:34:28 CEST unzip-6.0-2.el6_6.x86_64 gio 30 apr 2015 00:34:27 CEST unixODBC-2.2.14-14.el6.x86_64 gio 30 apr 2015 00:34:27 CEST pigz-2.3.3-1.el6.x86_64 gio 30 apr 2015 00:34:27 CEST perl-Time-HiRes-1.9721-136.el6_6.1.x86_64 gio 30 apr 2015 00:34:27 CEST perl-TimeDate-1.16-13.el6.noarch gio 30 apr 2015 00:34:27 CEST perl-Digest-SHA-5.47-136.el6_6.1.x86_64 gio 30 apr 2015 00:34:27 CEST perl-CGI-3.51-136.el6_6.1.x86_64 gio 30 apr 2015 00:34:27 CEST numactl-2.0.9-2.el6.x86_64 gio 30 apr 2015 00:34:27 CEST libxml2-python-2.7.6-17.el6_6.1.x86_64 gio 30 apr 2015 00:34:27 CEST tcsh-6.17-25.el6_6.x86_64 gio 30 apr 2015 00:34:26 CEST grub-0.97-93.el6.x86_64 gio 30 apr 2015 00:34:26 CEST boost-program-options-1.41.0-25.el6.centos.x86_64 gio 30 apr 2015 00:34:26 CEST authconfig-6.1.12-19.el6.x86_64 gio 30 apr 2015 00:34:26 CEST audit-2.3.7-5.el6.x86_64 gio 30 apr 2015 00:34:26 CEST mailx-12.4-8.el6_6.x86_64 gio 30 apr 2015 00:34:25 CEST cyrus-sasl-plain-2.1.23-15.el6_6.2.x86_64 gio 30 apr 2015 00:34:25 CEST xz-lzma-compat-4.999.9-0.5.beta.20091007git.el6.x86_64 gio 30 apr 2015 00:34:23 CEST fontconfig-2.8.0-5.el6.x86_64 gio 30 apr 2015 00:34:23 CEST efibootmgr-0.5.4-12.el6.x86_64 gio 30 apr 2015 00:34:23 CEST e2fsprogs-1.41.12-21.el6.x86_64 gio 30 apr 2015 00:34:23 CEST dbus-1.2.24-8.el6_6.x86_64 gio 30 apr 2015 00:34:23 CEST wget-1.12-5.el6_6.1.x86_64 gio 30 apr 2015 00:34:22 CEST perl-Archive-Tar-1.58-136.el6_6.1.x86_64 gio 30 apr 2015 00:34:22 CEST pciutils-3.1.10-4.el6.x86_64 gio 30 apr 2015 00:34:22 CEST nmap-5.51-4.el6.x86_64 gio 30 apr 2015 00:34:22 CEST cyrus-sasl-md5-2.1.23-15.el6_6.2.x86_64 gio 30 apr 2015 00:34:21 CEST sudo-1.8.6p3-15.el6.x86_64 gio 30 apr 2015 00:34:20 CEST gnupg2-2.0.14-8.el6.x86_64 gio 30 apr 2015 00:34:20 CEST postgresql-libs-8.4.20-2.el6_6.x86_64 gio 30 apr 2015 00:34:19 CEST php-pecl-msgpack-0.5.6-1.el6.remi.5.4.x86_64 gio 30 apr 2015 00:34:19 CEST php54-php-pecl-msgpack-0.5.6-1.el6.remi.x86_64 gio 30 apr 2015 00:34:18 CEST libcgroup-0.40.rc1-15.el6_6.x86_64 gio 30 apr 2015 00:34:18 CEST rsyslog-5.8.10-10.el6_6.x86_64 gio 30 apr 2015 00:34:17 CEST libdrm-2.4.52-4.el6.x86_64 gio 30 apr 2015 00:34:17 CEST cyrus-sasl-2.1.23-15.el6_6.2.x86_64 gio 30 apr 2015 00:34:17 CEST crda-1.1.3_2014.06.13-1.el6.x86_64 gio 30 apr 2015 00:34:17 CEST bfa-firmware-3.2.23.0-2.el6.noarch gio 30 apr 2015 00:34:17 CEST parted-2.1-25.el6.x86_64 gio 30 apr 2015 00:34:16 CEST openssh-server-5.3p1-104.el6_6.1.x86_64 gio 30 apr 2015 00:34:16 CEST openssh-clients-5.3p1-104.el6_6.1.x86_64 gio 30 apr 2015 00:34:16 CEST mdadm-3.3-6.el6_6.1.x86_64 gio 30 apr 2015 00:34:16 CEST system-config-firewall-tui-1.2.27-7.2.el6_6.noarch gio 30 apr 2015 00:33:39 CEST ntp-4.2.6p5-3.el6.centos.x86_64 gio 30 apr 2015 00:33:39 CEST php54-php-pear-1.9.5-9.el6.remi.noarch gio 30 apr 2015 00:33:38 CEST yum-utils-1.1.30-30.el6.noarch gio 30 apr 2015 00:33:37 CEST php-pear-1.9.5-10.el6.remi.noarch gio 30 apr 2015 00:33:37 CEST mysql-server-5.5.43-1.el6.remi.x86_64 gio 30 apr 2015 00:33:37 CEST mod_ssl-2.2.15-39.el6.centos.x86_64 gio 30 apr 2015 00:33:32 CEST mod_python-3.3.1-16.el6.x86_64 gio 30 apr 2015 00:33:32 CEST dhclient-4.1.1-43.P1.el6.centos.1.x86_64 gio 30 apr 2015 00:33:32 CEST kernel-2.6.32-504.16.2.el6.x86_64 gio 30 apr 2015 00:33:31 CEST kernel-firmware-2.6.32-504.16.2.el6.noarch gio 30 apr 2015 00:33:24 CEST libX11-common-1.6.0-2.2.el6.noarch gio 30 apr 2015 00:33:22 CEST libX11-1.6.0-2.2.el6.x86_64 gio 30 apr 2015 00:33:22 CEST libevent-last-2.0.22-1.el6.remi.x86_64 gio 30 apr 2015 00:33:21 CEST dhcp-common-4.1.1-43.P1.el6.centos.1.x86_64 gio 30 apr 2015 00:33:21 CEST mysql-5.5.43-1.el6.remi.x86_64 gio 30 apr 2015 00:33:20 CEST httpd-tools-2.2.15-39.el6.centos.x86_64 gio 30 apr 2015 00:33:20 CEST yum-plugin-fastestmirror-1.1.30-30.el6.noarch gio 30 apr 2015 00:33:18 CEST yum-3.2.29-60.el6.centos.noarch gio 30 apr 2015 00:33:18 CEST rpm-python-4.8.0-38.el6_6.x86_64 gio 30 apr 2015 00:33:18 CEST system-config-firewall-base-1.2.27-7.2.el6_6.noarch gio 30 apr 2015 00:33:17 CEST ntpdate-4.2.6p5-3.el6.centos.x86_64 gio 30 apr 2015 00:33:16 CEST nfs-utils-lib-1.1.5-9.el6.x86_64 gio 30 apr 2015 00:33:16 CEST nfs-utils-1.2.3-54.el6.x86_64 gio 30 apr 2015 00:33:16 CEST iptables-ipv6-1.4.7-14.el6.x86_64 gio 30 apr 2015 00:33:16 CEST dracut-kernel-004-356.el6_6.2.noarch gio 30 apr 2015 00:33:16 CEST dracut-004-356.el6_6.2.noarch gio 30 apr 2015 00:33:16 CEST openssh-5.3p1-104.el6_6.1.x86_64 gio 30 apr 2015 00:33:10 CEST udev-147-2.57.el6.x86_64 gio 30 apr 2015 00:33:09 CEST policycoreutils-2.0.83-19.47.el6_6.1.x86_64 gio 30 apr 2015 00:33:09 CEST util-linux-ng-2.17.2-12.18.el6.x86_64 gio 30 apr 2015 00:33:08 CEST iptables-1.4.7-14.el6.x86_64 gio 30 apr 2015 00:33:07 CEST iproute-2.6.32-33.el6_6.x86_64 gio 30 apr 2015 00:33:07 CEST initscripts-9.03.46-1.el6.centos.1.x86_64 gio 30 apr 2015 00:33:07 CEST openldap-2.4.39-8.el6.x86_64 gio 30 apr 2015 00:33:06 CEST rpm-libs-4.8.0-38.el6_6.x86_64 gio 30 apr 2015 00:33:05 CEST rpm-4.8.0-38.el6_6.x86_64 gio 30 apr 2015 00:33:05 CEST libssh2-1.4.2-1.el6_6.1.x86_64 gio 30 apr 2015 00:33:05 CEST libcurl-7.19.7-40.el6_6.4.x86_64 gio 30 apr 2015 00:33:05 CEST curl-7.19.7-40.el6_6.4.x86_64 gio 30 apr 2015 00:33:05 CEST openssl-1.0.1e-30.el6.8.x86_64 gio 30 apr 2015 00:33:04 CEST mysql-libs-5.5.43-1.el6.remi.x86_64 gio 30 apr 2015 00:33:04 CEST libpciaccess-0.13.3-0.1.el6.x86_64 gio 30 apr 2015 00:33:02 CEST hwdata-0.233-11.1.el6.noarch gio 30 apr 2015 00:33:02 CEST ethtool-3.5-5.el6.x86_64 gio 30 apr 2015 00:33:02 CEST module-init-tools-3.9-24.el6.x86_64 gio 30 apr 2015 00:33:01 CEST binutils-2.20.51.0.2-5.42.el6.x86_64 gio 30 apr 2015 00:33:01 CEST php54-runtime-2.1-1.el6.remi.x86_64 gio 30 apr 2015 00:33:00 CEST scl-utils-20120927-27.el6_6.x86_64 gio 30 apr 2015 00:32:54 CEST perl-Test-Harness-3.17-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:54 CEST perl-ExtUtils-ParseXS-2.2003.0-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:54 CEST perl-ExtUtils-MakeMaker-6.55-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:54 CEST perl-devel-5.10.1-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:54 CEST libxcb-1.9.1-2.el6.x86_64 gio 30 apr 2015 00:32:54 CEST less-436-13.el6.x86_64 gio 30 apr 2015 00:32:54 CEST gzip-1.3.12-22.el6.x86_64 gio 30 apr 2015 00:32:54 CEST procps-3.2.8-30.el6.x86_64 gio 30 apr 2015 00:32:53 CEST perl-Package-Constants-0.02-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:53 CEST libtirpc-0.2.1-10.el6.x86_64 gio 30 apr 2015 00:32:53 CEST shared-mime-info-0.70-6.el6.x86_64 gio 30 apr 2015 00:32:52 CEST libss-1.41.12-21.el6.x86_64 gio 30 apr 2015 00:32:52 CEST grubby-7.0.15-7.el6.x86_64 gio 30 apr 2015 00:32:52 CEST glib2-2.28.8-4.el6.x86_64 gio 30 apr 2015 00:32:52 CEST e2fsprogs-libs-1.41.12-21.el6.x86_64 gio 30 apr 2015 00:32:52 CEST device-mapper-persistent-data-0.3.2-1.el6.x86_64 gio 30 apr 2015 00:32:51 CEST xz-4.999.9-0.5.beta.20091007git.el6.x86_64 gio 30 apr 2015 00:32:50 CEST keyutils-1.4-5.el6.x86_64 gio 30 apr 2015 00:32:50 CEST file-5.04-21.el6.x86_64 gio 30 apr 2015 00:32:50 CEST at-3.1.10-44.el6_6.2.x86_64 gio 30 apr 2015 00:32:50 CEST perl-IO-Zlib-1.09-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST perl-IO-Compress-Zlib-2.021-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST perl-IO-Compress-Base-2.021-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST perl-Compress-Zlib-2.021-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST perl-Compress-Raw-Zlib-2.021-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST pciutils-libs-3.1.10-4.el6.x86_64 gio 30 apr 2015 00:32:49 CEST libselinux-utils-2.0.94-5.8.el6.x86_64 gio 30 apr 2015 00:32:49 CEST freetype-2.3.11-15.el6_6.1.x86_64 gio 30 apr 2015 00:32:49 CEST dbus-libs-1.2.24-8.el6_6.x86_64 gio 30 apr 2015 00:32:49 CEST nss-softokn-3.14.3-22.el6_6.x86_64 gio 30 apr 2015 00:32:48 CEST libuuid-2.17.2-12.18.el6.x86_64 gio 30 apr 2015 00:32:48 CEST libudev-147-2.57.el6.x86_64 gio 30 apr 2015 00:32:48 CEST libblkid-2.17.2-12.18.el6.x86_64 gio 30 apr 2015 00:32:48 CEST krb5-libs-1.10.3-37.el6_6.x86_64 gio 30 apr 2015 00:32:48 CEST keyutils-libs-1.4-5.el6.x86_64 gio 30 apr 2015 00:32:48 CEST file-libs-5.04-21.el6.x86_64 gio 30 apr 2015 00:32:48 CEST elfutils-libelf-0.158-3.2.el6.x86_64 gio 30 apr 2015 00:32:48 CEST xz-libs-4.999.9-0.5.beta.20091007git.el6.x86_64 gio 30 apr 2015 00:32:47 CEST pam-1.1.1-20.el6.x86_64 gio 30 apr 2015 00:32:47 CEST cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 gio 30 apr 2015 00:32:47 CEST coreutils-8.4-37.el6.x86_64 gio 30 apr 2015 00:32:47 CEST shadow-utils-4.1.4.2-19.el6_6.1.x86_64 gio 30 apr 2015 00:32:45 CEST libstdc++-4.4.7-11.el6.x86_64 gio 30 apr 2015 00:32:45 CEST grep-2.6.3-6.el6.x86_64 gio 30 apr 2015 00:32:45 CEST coreutils-libs-8.4-37.el6.x86_64 gio 30 apr 2015 00:32:45 CEST audit-libs-2.3.7-5.el6.x86_64 gio 30 apr 2015 00:32:45 CEST perl-5.10.1-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:44 CEST libxml2-2.7.6-17.el6_6.1.x86_64 gio 30 apr 2015 00:32:44 CEST libselinux-2.0.94-5.8.el6.x86_64 gio 30 apr 2015 00:32:44 CEST libcom_err-1.41.12-21.el6.x86_64 gio 30 apr 2015 00:32:44 CEST perl-version-0.77-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:38 CEST perl-Pod-Simple-3.13-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:38 CEST perl-Pod-Escapes-1.04-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:38 CEST perl-Module-Pluggable-3.90-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:38 CEST perl-libs-5.10.1-136.el6_6.1.x86_64 gio 30 apr 2015 00:32:38 CEST tzdata-2015d-1.el6.noarch gio 30 apr 2015 00:32:23 CEST nss-softokn-freebl-3.14.3-22.el6_6.x86_64 gio 30 apr 2015 00:32:23 CEST bash-4.1.2-29.el6.x86_64 gio 30 apr 2015 00:32:23 CEST kernel-headers-2.6.32-504.16.2.el6.x86_64 gio 30 apr 2015 00:32:22 CEST centos-release-6-6.el6.centos.12.2.x86_64 gio 30 apr 2015 00:32:22 CEST libgcc-4.4.7-11.el6.x86_64 gio 30 apr 2015 00:32:21 CEST

Thanks for any helping infos and excuse me for my english :)

I hope to find a solution because before this problem, as you can see in the attachment, the Server was very fast and stable. It never use more than 50% of the CPU but now is always at 100%.

Bye

eldk commented 9 years ago

Hello,

I think you should check first that memcached is running and if running push a "echo stats | nc ip port" where ip and port are memcached listening ip and port. I think that you should check your memcached conf : ip, port, user. Your mod_pagespeed.conf for memcached configuration : ip:port

Eric

capn3m0 commented 9 years ago

Hi eldk, memcached is working fine. It is working and mod pagespeed is properly configurated:

ModPagespeedMemcachedServers 127.0.0.1:11211
ModPagespeedMemcachedTimeoutUs 50000
ModPagespeedMemcachedThreads 1

This is the echo of memcached stats:

Every 2,0s: echo stats | nc 127.0.0.1 11211 Sun May 31 14:12:33 2015

STAT pid 22027 STAT uptime 42988 STAT time 1433074353 STAT version 1.4.22 STAT libevent 2.0.22-stable STAT pointer_size 64 STAT rusage_user 12.063166 STAT rusage_system 33.538901 STAT curr_connections 66 STAT total_connections 2771 STAT connection_structures 169 STAT reserved_fds 20 STAT cmd_get 153577 STAT cmd_set 65655 STAT cmd_flush 0 STAT cmd_touch 0 STAT get_hits 83218 STAT get_misses 70359 STAT delete_misses 0 STAT delete_hits 0 STAT incr_misses 0 STAT incr_hits 0 STAT decr_misses 0 STAT decr_hits 0 STAT cas_misses 0 STAT cas_hits 0 STAT cas_badval 0 STAT touch_hits 0 STAT touch_misses 0 STAT auth_cmds 0 STAT auth_errors 0 STAT bytes_read 676621121 STAT bytes_written 1487768975

It was configured long time ago and only in the last 5 days starts having problems.

As you can see in the cpu graph only from 27/may 14pm all the system start using a lot of cpu. Before the 27/5 nothing change, memcached was configured 2 year ago and always working and pagespeed too.

When we disabling Pagespeed all the system work fine with no cpu peak but we don't understand what's changend. We made the update of all the system the 30/apr but we think (we aren't sure) that system restart and httpd restart was excuted only the 27/may when problem starts.

Thanks

eldk commented 9 years ago

Hello @jeffkaufman ,

Since my first tries, I have migrated to a more powered server : more cpu, ram and bandwith (100 Mbps).

I monitored all "waiting for completion ..." messages, and it seems to occur only on "canceled" download (partial send/get? of file) - 99,9 % : pictures (the more large ones), some "big" js.

I have tried and monitored apache2.2/fgcid/mpm-prefork and apache2/fcgid/mpm-worker with or without memcached (1.4.24) - libevent 2.0.16-stable : always the same result.

It occurs for files cached in memcached (<1024ko) or on disk (>=1024ko).

All my pages are php generated and cached before (for some one's a day, for others one's a week ..., or on content modified) mod_pagespeed get them and served them to visitors - so the server ressources used are even less of 5% average (top).

Thanks,

Eric

eldk commented 9 years ago

Hello @capn3m0 ,

have you tried to disable memcached, and all other caches except file cache for mod_pagespeed ?

https://developers.google.com/speed/pagespeed/module/system

In this way (for testing), you should restart your webserver without loose the files that already have been rewritten.

Eric

capn3m0 commented 9 years ago

Yes we try

Memcached On for Pagespeed | Memcached Daemon On Memcached Off for Pagespeed | Memcached Daemon On (for other apps that use it) Memcached Off for Pagespeed | Memcached Daemon Off

but nothing change.

capn3m0 commented 9 years ago

I made some tests and maybe i found who is causing all this troubles: MMAP.

I'm not expert but disabling MMAP in Apache adding this in httpd.conf:

EnableMMAP off

a part of my problem are solved. No more cpu peak and general high cpu load.

I has downgraded mod pagespeed to 1.7.30 and for about 1 hour all the system was stable with pagespeed (finally) On.

After i tried the last 1.9.32.3 but cpu and error "Waiting for completion of URL http://exampledomain.com/example-slug/" starts again.

Now i downgraded to 1.8.31.6 and system is stable, no high cpu and no "Waiting for completion.."

Remains some problem related to Memcached like this kind of error:

AprMemCache::Put error: Could not find specified socket in poll list. (70015)

but the system is stable and with pagespeed On.

I will investigate..

Someone can try if using "EnableMMAP off" in apache solve problem related to cpu high load?

jmarantz commented 9 years ago

capn3m0: this is some great detective work. Thanks for your persistence. We hadn't thought of this EnableMMAP issue and will check it out.

Note that InPlaceResourceOptimization existed in 1.8, and might be vulnerable to the same problems as 1.9, but simply did not print a message when it got stuck in that loop. So the problem may be happening without a message.

The AprMemCache issues may simply be timeout tuning issues. The default timeout settings might not be appropriate for your setup, depending on networking delays and whether you have mod_pagespeed communicating directly with standard memcached, or whether you are using a proxy or something else that speaks memcached protocol. You can increase the timeout with ModPagespeedMemcachedTimeoutUs timeout_in_microseconds

One thing I'd love to learn from your setup is to narrow down whether or not 1.9 gives you stable system load with all 8 combinations of memcached enabled/disabled (use file cache) InplaceOptimization enabled/disabled MMAP enabled/disabled

In other words, please fill out this table:

                                                                 1.9 is

stable Y/N memcached off, ipro off, mmap off ____ memcached off, ipro off, mmap on ____ memcached off, ipro on, mmap off ____ memcached off, ipro on, mmap on ____ memcached on, ipro off, mmap off ____ memcached on, ipro off, mmap on ____ memcached on, ipro on, mmap off ____ memcached on, ipro on, mmap on ____

Also, when you ran 'yum update', do you know what else got updated besides mod_pagespeed?

Thanks! -Josh

On Sun, May 31, 2015 at 6:45 PM, capn3m0 notifications@github.com wrote:

I made some tests and maybe i found who is causing all this troubles: MMAP.

I'm not expert but disabling MMAP in Apache adding this in httpd.conf:

EnableMMAP off

a part of my problem are solved. No more cpu peak and general high cpu load.

I has downgraded mod pagespeed to 1.7.30 and for about 1 hour all the system was stable with pagespeed (finally) On.

After i tried the last 1.9.32.3 but cpu and error "Waiting for completion of URL http://exampledomain.com/example-slug/" starts again.

Now i downgraded to 1.8.31.6 and system is stable, no high cpu and no "Waiting for completion.."

Remains some problem related to Memcached like this kind of error:

AprMemCache::Put error: Could not find specified socket in poll list. (70015)

but the system is stable and with pagespeed On.

I will investigate..

Someone can try if using "EnableMMAP off" in apache solve problem related to cpu high load?

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-107254703 .

jeffkaufman commented 9 years ago

2.6.32-504.16.2.el6.x86_64

This is the most current 2.6.32 kernel for 6.6, and I just looked at the source and verified that it is not vulnerable to the futex bug (get_futex_key_refs does have default: smp_mb(); /* explicit MB (B) */). So it doesn't look like this is a kernel issue.

jeffkaufman commented 9 years ago

@capn3m0

We don't use nothing of this: ...

ModPagespeedInPlaceResourceOptimization (on by default starting in 1.9)

When you say you don't use ModPagespeedInPlaceResourceOptimization do you mean that when you upgraded to 1.9 you added ModPagespeedInPlaceResourceOptimization off to your config? Otherwise you were actually running it: with the 1.9 upgrade we changed this feature from default-off to default-on.

jeffkaufman commented 9 years ago

@capn3m0

Someone can try if using "EnableMMAP off" in apache solve problem related to cpu high load?

Are you serving static files from NFS that might be edited elsewhere? That's known to cause problems with mmap: http://httpd.apache.org/docs/2.2/misc/perf-tuning.html (This is probably not it, but it's worth asking.) There may also be things PageSpeed does differently when mmap is enabled, as a result of calling lower-level apache functions; I'll look into that.

And to confirm, EnableMMAP off fixed your high cpu usage issues, but it didn't fix your "waiting for completion of fetch" issues, right?

jeffkaufman commented 9 years ago

There may also be things PageSpeed does differently when mmap is enabled, as a result of calling lower-level apache functions; I'll look into that.

It looks like EnableMMAP (via conf->enable_mmap) controls only what happens when Apache needs to load a file into memory (file_bucket_read). When PageSpeed needs to load files into memory, which it does when serving files from its cache, it uses its own file manipulations functions which don't consider enable_mmap.

giggioman00 commented 9 years ago

Hi guys, any news about this bug? I use the default mod_pagespeed configuration and sometimes I get the Waiting for completion of URL problem too...

How to fix? I didn't made any changes to the default mod_pagespeed config

jmarantz commented 9 years ago

Not yet, although we are working on a patch that will (a) help us debug it, and (b) avoid typing up an Apache process indefinitely.

There are a number of workaround suggested depending on your config and details about your symptoms. Questions about your setup:

Do you use memcached?
If you use memcached, do you have a direct connection to it, or do you go through a proxy?
If you are using a file-cache, is it on a local hard disk or a remote-mounted one?
Do you use LoadFromFile directives?
Do you use MapProxyDomain directives?
What is your linux kernel version (there's a linux kernel futex bug https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 we suspect is involved in some cases)
What is your Apache MPM?
What other modules do you have installed (e.g. fcgi)?

Questions about your symptoms:

What sort of files are being reported in the WaitingForCompletion messages?
Large ones only, or both large & small
Resources only, or also HTML?
Do your WaitingForCompletion messages keep going indefinitely (1 per second) or does completion eventually occur?
What other error messages do you see in the logs?

On Thu, Jun 11, 2015 at 4:18 PM, giggioman00 notifications@github.com wrote:

Hi guys, any news about this bug? I use the default mod_pagespeed configuration and sometimes I get the Waiting for completion of URL problem too...

How to fix? I didn't made any changes to the default mod_pagespeed config

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-111264336 .

giggioman00 commented 9 years ago

No
Page_speed use cache on local disk. And even the joomla cache save the cached fle on local disk.
No
This directive doesn't appear under my pagespeed configuration, so I don't think
Linux 3.10.0-229.1.2.el7.x86_64
prefork
Any other modules

Then

Images and pages link
I don't use large images on my site...
Images and pages link (for example: Homepage, category, etc, etc)
I get 3-4 messages per second
I get two other errors:
1. Could not create directories for file /var/cache/mod_pagespeed/*\ (but permissions are right... I get this problem only when mod_pagespeed goes wild)
2. Fetch timed out:

jmarantz commented 9 years ago

Those directory creation errors are suspicious. Could you be out of disk space? Or inodes? Can you give more details on the permissions of /var/cache/mod_pagespeed and what user the Apache children run as? On Jun 11, 2015 5:05 PM, "giggioman00" notifications@github.com wrote:

No

Page_speed use cache on local disk. And even the joomla cahce save the cached fle on local disk.

No

This directive doesn't appear under my pagespeed configuration, so I don't think

Linux 3.10.0-229.1.2.el7.x86_64

prefork 7.

Any other modules 8.

Images and pages link

I don't use large images on my site...

Images and pages link (for ecample: Hompage, category, etc, etc)

I get 3-4 messages per second

I get two other errors:

Could not create directories for file /var/cache/mod_pagespeed/*\ (but permissions are right... I get this problem only when mod_pagespeed goes wild)

Fetch timed out:

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-111277375 .

giggioman00 commented 9 years ago

Nope, I'm not out of space... My VPS have 23.04 GB of total space but only 6 GB are used. Apache children runs as: user and permissions of /var/cache/mod_pagespeed are user:user Usually I don't get this error. It appears only when appears the Waiting for completion of URL problem

If possible I want to fix the Waiting for completion of URL, but I need to say that it' doesn't seems a big problem... My VPS goes offline due to this error only for 2-5 minutes every 5-7 days.

jmarantz commented 9 years ago

Can you check if you are out of inodes? Sometimes we run into file-systems' inode limits before we run out of disk, but you can set our inode-limit in the file-cache so we start deleting old cache entries to reach the inode-target and the disk-space target.

df -i should tell you the number of inodes used and free on the file system.

This is probably a red herring but it's worth checking. Something is preventing you from doing file-cache operations that's normal works, and you must be running out of something. Could be open file-descriptors. Could be i-nodes. Not sure yet what else it could be.

Maybe run 'lsof (http://linux.die.net/man/8/lsof)' next time you see these Waiting for Completion messages.

I can think of two things that would cause the Waiting For Completion messages:

Some file-operation blocks indefinitely (as you are not using memcached).
The linux futex bug (have to wait for someone to help look up whether your kernel version is affected).

Could you paste more of your log file? If you want to hide your site just replace your domain name in the file first.

-Josh

On Thu, Jun 11, 2015 at 7:45 PM, giggioman00 notifications@github.com wrote:

Nope, I'm not out of space... My VPS have 23.04 GB of total space but only 6 GB are used. Spache childrn runs as: user and permissions of /var/cache/mod_pagespeed are user:user Usually I don't get this error. It appears only when appears the Waiting for completion of URL problem

— Reply to this email directly or view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1048#issuecomment-111308028 .

giggioman00 commented 9 years ago

This is what I get if I type df -i:

Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 24165376 120030 24045346 1% / devtmpfs 254703 318 254385 1% /dev tmpfs 256195 1 256194 1% /dev/shm tmpfs 256195 339 255856 1% /run tmpfs 256195 13 256182 1% /sys/fs/cgroup

But I need to say that today my site didn't gone offline.

Next time I'll try with lsof but it's hard... Usually I get notification from pingdom when my VPS goes offline, but usually in that time I'm not online so I cannot check. And when I'm online the problem is gone (Just I said, it appears only for 2-5 minutes... )

Now I send you the log of yesterday: http://pastebin.com/c0sBSXLV Then the VPS went online again. Domain and VPS IP are hide.

eldk commented 9 years ago

Hello,

For one javascript that was giving "Waiting for ..." messages, with Apache2 debug log on, I have had a mod_pagespeed error message that says that there is a problem with it. I m looking after it to publish it. I disallow mod_pagespeed to rewrite it - not a big concern.

For images, it seems that images that are giving "Waiting for ..." messages, are always the same ones, there is clearly something that is reduntant. And most of them have disappeared from top positions in Google image.

I have change nothing to the use : IPRO, LoadFromFile ...BandWidthSave I have give more ressources to the domain (which has been moved alone) : 8Gb RAM, 6 cores, 3GB memcached, 100Mb bandwith up and down, for about 100000 apache reqs/day and the messages still occures.

I let go for few days run (without debug log) and see.

Greatings,

Eric

eldk commented 9 years ago

Hello, This command should help some to count and see how many and which files are concerned :

awk -F'[[]]+' '$4 == "warn"{print $7}' /var/log/apache2/error.log | sort | uniq -c | sort -nr | grep "for 5.| 5 sec" | more

See here if explanations needed : http://sudarmuthu.com/blog/how-to-print-unique-errorswith-count-from-apache-error-logs/

For my case it is appearing the most frequently on some pictures (the unique file that was pulling the most messages was a javascript ). But I found, very less frequently some html, and some little assets images.

I let it run for few days

Greatings,

Eric

crowell commented 9 years ago

We've built a test release with @jeffkaufman's fixes to the ApacheFetch.

If you've been experiencing this bug, please give it a try and let us know if it fixes the issue for you.

The packages can be found here https://github.com/pagespeed/mod_pagespeed/releases/tag/1.9.32.5

Thanks in advance for your feedback!

jeffkaufman commented 9 years ago

Is anyone still having this bug?

We're preparing a release with a mitigation for the bug, but because we haven't been able to reproduce it here it would be really very helpful if someone could install the packages @crowell built [1] and see if this issue recurs.

[1] https://github.com/pagespeed/mod_pagespeed/releases/tag/1.9.32.5

jeffkaufman commented 9 years ago

Closing this issue, because the ApacheFetch mitigations should be sufficient for the cases we've seen reported. If this is still happening to you after 1.9.32.5 please reopen.

jeffkaufman commented 9 years ago

Update to my comment above:

Having your cache on a memcached should not trigger this message, even if memcached is slow. PageSpeed has good timeouts around memcached and this code is stable.
Having your filecache or LoadFromFile target on a slow filesystem can trigger this bug, if the filesystem takes more than 5s to respond. If fread hangs, perhaps because of an unreachable nfs server, then in mod_pagespeed < 1.9.32.5 the relevant apache and pagespeed optimized threads will be stuck, while in mod_pagespeed >= 1.9.32.5 the apache thread will be released after 2min and only the pagespeed optimizer thread will be stuck.

To reproduce this:

Set up load from file for some file named style.css.

Patch mod_pagespeed:

--- a/kernel/base/stdio_file_system.cc
+++ b/kernel/base/stdio_file_system.cc
@@ -107,6 +107,16 @@ class StdioInputFile : public FileSystem::InputFile {
}

virtual int Read(char* buf, int size, MessageHandler* message_handler) {
+    StringPiece spfname(file_helper_.filename_);
+    if (spfname.ends_with("style.css")) {
+      while (true) {
+        LOG(WARNING) << "Read("
+                     << file_helper_.filename_ << ") is sleeping";
+        sleep(1);
+      }
+    } else {
+      LOG(WARNING) << "Read(" << file_helper_.filename_ << ") allowed";
+    }
int ret = fread(buf, 1, size, file_helper_.file_);
file_helper_.CountNewlines(buf, ret);
if ((ret == 0) && (ferror(file_helper_.file_) != 0)) {

fetch style.css from your patched server
observe "waiting for completion of url" messages

Like most software, PageSpeed assumes that file reads are effectively non-blocking, and doesn't perform well if they take a significant amount of time.

apache / incubator-pagespeed-mod

Apache stuck indefinitely waiting for PSOL #1048