apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
697 stars 159 forks source link

Add ModPagespeedMapRewriteFailedDomain directive #551

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Using mod_pagespeed in some infrastructures for supporting high-traffic sites 
has the potential to dramatically slow down site access.

Consider the following situation:

Initially: 

* The site is served by a set of web-servers behind a load-balancer that only 
serve dynamic html pages.  The servers do not store any JS, CSS, or image 
files.  The pages are served with all JS, CSS and image URLs referencing a CDN 
domain who's origin stores those resources (and which isn't the site end-point 
and doesn't run Apache).

* The servers can support the traffic for only the HTML requests, but would not 
be able to handle the added traffic if the same clients also make requests for 
JS/CS/Images.

After mod_pagespeed implemented:

* All the CS/JS/Image resources are made available to the web-servers as 
subresources (local filesystem files) and mod_pagespeed was enabled.  The 
ModPagespeedMapRewriteDomain directive is used to have rewritten resources 
reference the CDN domain.  The ModPagespeedLoadFromFile directive is used so 
that all JS/CS/Image files can be fetched by mod_pagespeed directly from the 
local filesystems (which are mirrored, but not shared - i.e. cache files 
generated by mod_pagespeed are not propagated across servers)

* The CDN is reconfigured to make 'origin pull' requests to the web-servers 

Result:

* Whereas previously, no JS/CSS/Image requests at all were made to the 
web-servers, now they experience a significant volume of traffic accessing the 
JS/CS/Image files and they are unable to handle the load and consequently 
respond slowly and frequently return 500 HTTP responses causing degraded client 
experience.

The reason for this is that, despite the fact that the JS/CS/Image resources 
are on the web-server local filesystems and are not changed (neither content 
nor fs file modification timestamps change), mod_pagespeed often serves html 
pages in which the unmodified resources are referenced directly using the site 
domain.  Since they are not always rewritten, the ModPagespeedMapRewriteDomain 
is not applied to have the URLs referenced via the CDN (which will have cached 
copies of the resources with far future expiry times and max-ages).

(As an aside, I can't determine any pattern to the rewrite behaviour of 
mod_pagespeed.  In testing, if I access the same site page repeatedly at 1min 
intervals - which ideally would just return the same exact response each time - 
there are different combinations of rewritten and non-rewritten CSS files in 
each response and sometimes CSS files that were combined in a previous response 
are separate in a subsequent one.  There is no pattern - like a 5min period, 
for example - sometimes it returns the same response for minutes at a time 
(from <5 to >5) and then suddenly won't give the same response twice for 
several minutes).

To ensure that no JS/CS/Image resource is ever requested by client browsers 
directly to the site domain, I'd like to propose a 
ModPagespeedMapRewriteFailedDomain directive which works like the 
ModPagespeedMapRewriteDomain directive but the specified domain is used when an 
attempt to rewrite a resource fails.

That way, the configuration can ensure that both rewritten and not-rewritten 
resources will always be requested by clients via the CDN domain, which will 
only request non-rewritten resources from the origin site server rarely due to 
the long expires/max-age headers used (along with query-string version 
parameters).

Thank you!

Original issue reported on code.google.com by da...@foremosttravel.com on 28 Oct 2012 at 1:39

GoogleCodeExporter commented 9 years ago
Thanks for the thoughtful analysis.  I think what you are looking for to get 
domain-mapping on unrewritten resources is the filter 'rewrite_domains'.  Hope 
that solves that aspect of the problem:

https://developers.google.com/speed/docs/mod_pagespeed/filter-domain-rewrite

We definitely want to get to the bottom of the situation where (say) unchanged 
css files that are loaded-from-file are served unoptimized after they were 
previously served optimized.  Does this also occur with js/images or only css?  
css is more complex because it's got optimizable images nested inside. 

One potential explanation is that you have load-from-file on CSS files but are 
using http-fetching to get the images embedded in the CSS files, and if those 
have expired (no expires/cache-control headers imply 5 minute TTL) then that 
might explain the behavior you see. 

Original comment by jmara...@google.com on 28 Oct 2012 at 12:08

GoogleCodeExporter commented 9 years ago
Helpful suggestion.  I've done some further 'playing around'.  The scenario I 
outlined above was intended to be hypothetical for illustration, but obviously 
corresponds closely to our situation - but not exactly.  In particular, we have 
two sources for images: images related to the site's actual content (stored in 
AWS S3) and those related to the site interface (site background, button 
decorations, that sort of thing - stored on each web-server) - which are served 
by the *same domain*.  Specifically, CloudFront allows configuring path 
patterns to map to different origins from the same CDN domain.

So, I modified the setup in light of your suggestion about images being 
referenced from CSS.  To simplify, I've used a separate CDN domain which maps 
paths 1-1 to the site origin web-servers.  I've set the load-from-file to 
specify the entire site from the root - so that that it should apply to all 
CSS/JS and Images referenced from CSS and generally through the site's domain 
(The 'content' images are still served from the previous CDN domain with S3 
origin, but the mod_pagespeed isn't setup to rewrite that domain at all - 
keeping them independent).

Result: That did bring consistency to what is being re-written (or not) from 
request-to-request (after the initial request with empty cache).  By enabling 
domain_rewrite as you suggest, that also ensures that all resources are using 
CDN URLs, hence eliminating the problem I described.  While all the CSS files 
are now being served from CDN paths with *.pagespeed.<hash>.css names, none of 
them are being combined - which is presumably a separate cause/issue.

So, strictly, the feature suggestion of a ModPagespeedMapRewriteFailedDomain 
directive isn't necessary, though something like it would give more flexibility.
It would also be immensely helpful if the ModPagespeedMapOriginDomain and 
ModPagespeedMapRewriteDomain would allow specifying the request domain (e.g. 
HOST header value) or perhaps a wildcard *.  (If you think that is a reasonable 
feature request I'll open a new ticket for that).

Thanks for your help.  You can close this request out.
(now I need to figure out why my CSS files aren't being combined).

Original comment by da...@foremosttravel.com on 30 Oct 2012 at 5:24

GoogleCodeExporter commented 9 years ago

Original comment by jmara...@google.com on 26 Nov 2012 at 8:06

GoogleCodeExporter commented 9 years ago
> It would also be immensely helpful if the ModPagespeedMapOriginDomain and 
ModPagespeedMapRewriteDomain would allow specifying the request domain (e.g. 
HOST header value) or perhaps a wildcard *.

MapOriginDomain accepts a HOST header argument, new in 1.8: 
https://developers.google.com/speed/pagespeed/module/domains#mapping_origin

What would specifying a HOST header for MapRewriteDomain give us?

Original comment by jefftk@google.com on 23 Jun 2014 at 7:11