apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
696 stars 158 forks source link

Summary of mod_rewrite and mod_pagespeed interactions #676

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
This is not a defect per-se rather an explanation of how we work with 
mod_rewrite and what can (and does) go wrong. It's here so it's searchable by 
devs and users.

* mod_rewrite is a translate_name hook in slot APR_HOOK_FIRST-1
* mod_rewrite also sets the handler to 'rewrite-handler' (I think that's it;
  the point is that it sets the handler)
* mod_pagespeed adds its own translate_name hook in slot APR_HOOK_FIRST-2
* this hook saves the original URL before mod_rewrite rewrites it
* mod_pagespeed adds its own map_to_storage handler (instaweb_handler)
  that does the heavy lifting of rewriting/delivering rewritten content.
* mod_pagespeed uses various 'virtual' URLs, incl: mod_pagespeed_beacon,
  mod_pagespeed_statistics, mod_pagespeed_console, etc.
* These are enabled by <Location> directives in pagespeed.conf.
* This changes the request's handler to the specified string; f.ex in the
  case of mod_pagespeed_console it's 'mod_pagespeed_console'.
* mod_pagespeed's instaweb_handler routine first looks at the request's
  handler and processes its set of special cases; after that it looks at
  the URL and processes it accordingly; there are some special cases:
  mod_pagespeed_static and now/soon mod_pagespeed_beacon [I know I said
  above it uses a <Location> directive; at the time of writing this it
  does but I have a change in flight to also handle it here].
* I don't know if <Location> directives are processed before mod_rewrite
  runs or if mod_rewrite's processing effectively disables <Location>
  processing, but one thing is for sure: if an URL is rewritten by
  mod_rewrite then the <Location> directive does NOT set the request's
  handler - it stays as 'rewrite-handler'.

So, given all that, what happens when an URL is handled by mod_rewrite then 
mod_pagespeed?
1. mod_pagespeed saves the URL via its translate_name hook.
2. mod_rewrite looks at the URL; if it matches a rule it rewrites it AND
   sets the request's handler to 'rewrite-handler'.
3. mod_pagespeed handles the URL: since step 2 set the request's handler
   it doesn't match any of ours so we fall back to normal processing.
** This means that all URLs that are handled by <Location> directives do
   not work!
4. If we decline to handle the URL, in particular a beacon POST, Apache
   tries again and the above loop happens all over again, and again,
   until eventually Apache gives up and returns a 500 HTTP status.

The work-around is to put this into each and every <Location> directive in 
pagespeed.conf:
  <IfModule mod_rewrite.c>
    RewriteEngine Off
  </IfModule>

The 'fix' is to not rely on <Location> directives but to handle the URLs as 
special cases in the fallback path; this is what my change to 
mod_pagespeed_beacon does.

The most common setup we've seen where this arises is in WordPress sites since 
it adds some mod_rewrite rules to map URLs that aren't files or directories to 
the /index.html file.

How to setup mod_rewrite to replicate the 500 HTTP status problem:
1. In httpd.conf, in the <Directory "/usr/local/apache2/htdocs"> area
   make sure you have AllowOverride All.
2. Create a .htaccess file in /usr/local/apache2/htdocs with:
    RewriteEngine On
    RewriteBase /
    RewriteRule ^index\.html$ - [L]
    RewriteRule ^favicon.ico$ - [L]
    RewriteRule ^.*\.pagespeed\..*$ - [L]
    RewriteCond %{REQUEST_FILENAME} !-f
    RewriteCond %{REQUEST_FILENAME} !-d
    RewriteRule . /index.html [L]
   Explanation: Leave index.html untouched; leave favicon.ico untouched;
   leave .*.pagespeed..* untouched; if the file corresponding to the URL
   does NOT exist as a file or directory, rewrite the URL to /index.html.

Original issue reported on code.google.com by matterb...@google.com on 18 Apr 2013 at 2:19

GoogleCodeExporter commented 9 years ago

Original comment by matterb...@google.com on 18 Apr 2013 at 2:19

GoogleCodeExporter commented 9 years ago
Nice write up. Note that we can't really avoid <Location> in general since it's 
also tied up with access control, which we do want for most stuff (but not 
beacons).

Original comment by morlov...@google.com on 18 Apr 2013 at 2:27

GoogleCodeExporter commented 9 years ago
Clearing from my open issues list, should still be searchable though.

Original comment by matterb...@google.com on 27 Sep 2013 at 1:17

GoogleCodeExporter commented 9 years ago
OK, reopening but removing me as the assignee, since otherwise it's not easy to 
search for 'starred' bugs.

Original comment by matterb...@google.com on 27 Sep 2013 at 1:23

GoogleCodeExporter commented 9 years ago

Original comment by matterb...@google.com on 21 Oct 2013 at 11:41

JialuZhang commented 3 years ago

@GoogleCodeExporter

Thanks for the work around. However, in your posted configuration, the line: "RewriteEngine Off" is a misconfiguration, and adding it to your system will not change any system behavior. Why "RewriteEngine Off" is allowed by Apache is that, if you include multiple "RewriteRule" parameters in your configuration, then instead of commenting them all, you can explicitly using “RewriteEngine Off” to disable all "RewriteRule".

More importantly, the default value of “RewriteEngine" is already an "off", so adding “RewriteEngine Off" is quite unnecessary and it may cause confusion to users.

Since herein there is no "RewriteRule", deleting “RewriteEngine Off” would be ideal.

Related Apache source code snippet:

run_rewritemap_programs(server_rec *s , apr_pool_t *p){
if (conf->state == ENGINE_DISABLED) { // usage of "RewriteEngine"
  return APR_SUCCESS; // early return
rewritemap_program(...); // usage of "RewriteRule" 
}