apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
697 stars 159 forks source link

ModPagespeedRewriteUncacheableResources does not work in mod_pagespeed #818

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
in a virtual host i have

RewriteEngine on
RewriteRule ^/prefix(.*)$  http://127.0.0.1:31005/otherdir$1

Images that have the prefix /prefix will not be rewritten by mod pagespeed
(other images are)

This image url will be rewritten:
<img  src="cmsSplashScreen.jpg">

This one is not rewritten:
<img 
src="prefix/DE/repos/evoscripts/musica_sacra/returnBinaryImage/3/teaser/Bild.jpg
">

What version of the product are you using (please check X-Mod-Pagespeed
header)?
X-Mod-Pagespeed:1.4.26.5-3533

On what operating system?
CentOS release 6.4 (Final)

Which version of Apache?
Server:Apache/2.2.15 (CentOS)

Which MPM?
don't know

URL of broken page:
http://musicasacra.lemon42.com/test.html

Original issue reported on code.google.com by bernhard...@lemon42.com on 9 Nov 2013 at 11:07

GoogleCodeExporter commented 9 years ago
mod_rewrite tends to screw up mod_pagespeed unless you're very careful (and 
often they just can't work together even if you are careful :-).

That said, could you please check your Apache logs for any mod_pagespeed 
messages about the images that start with /prefix?

Original comment by matterb...@google.com on 11 Nov 2013 at 1:15

GoogleCodeExporter commented 9 years ago
hm. i need apache to proxy requests to the backend server, so i better go with 
mod_proxy? 

however, i compiled from source so i have the latest(?) version now, same issues
for whatever reason mod_pagespeed thinks this stuff is not cacheable.

i played around by setting lastmodified, expired and cache-control headers but 
no luck. is there any way to find out why mod_pagespeed thinks this stuff is 
not cacheable?

i set DebugLevel to info, this is what shows up in the apache error.log:

[Sun Nov 10 21:12:42 2013] [info] [mod_pagespeed 1.7.0.0-3616 @22674] HTTPCache 
key=http://musicasacra.lemon42.com/DE/repos/evoscripts/musica_sacra/returnBinary
Image/2/konzert/Bild_teaser.jpg: remembering not-cacheable status for 298 
seconds.

Thanks!

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 4:36

GoogleCodeExporter commented 9 years ago
ahh.. thats weird. now i'm using mod_proxy instead of mod_rewrite -
and pagespeed says  "remembering not-found status" for an image thats actually 
there 
- I can see it :)

[Mon Nov 11 17:10:41 2013] [info] [mod_pagespeed 1.7.0.0-3616 @23956] HTTPCache 
key=http://musicasacra.lemon42.com/cms/web/DE/mode/work/repos/evoscripts/musica_
sacra/returnBinaryImage/31/kuenstler/Foto: remembering not-found status for 259 
seconds.
[Mon Nov 11 17:10:59 2013] [info] [mod_pagespeed 1.7.0.0-3616 @23958] HTTPCache 
key=http://musicasacra.lemon42.com/cms/web/DE/mode/work/repos/evoscripts/musica_
sacra/returnBinaryImage/31/kuenstler/Foto: remembering not-found status for 241 
seconds.
[Mon Nov 11 17:11:05 2013] [info] [mod_pagespeed 1.7.0.0-3616 @23979] HTTPCache 
key=http://musicasacra.lemon42.com/cms/web/DE/mode/work/repos/evoscripts/musica_
sacra/returnBinaryImage/31/kuenstler/Foto: remembering not-found status for 235 
seconds.

but then again it has an expired cache entry?

[Mon Nov 11 17:15:12 2013] [info] [mod_pagespeed 1.7.0.0-3616 @23970] Cache 
entry is expired: 
http://musicasacra.lemon42.com/cms/web/DE/mode/work/repos/evoscripts/musica_sacr
a/returnBinaryImage/31/kuenstler/Foto

The image comes with these headers:
Cache-Control:max-age=600
Connection:close
Content-Length:10280
Content-Type:image/jpeg
Date:Mon, 11 Nov 2013 17:17:46 GMT
Last-Modified:Sun, 10 Nov 2013 17:17:46 GMT
Server:EvoWebBase/2.0
X-Extra-Header:1

I'm confused...

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 5:18

GoogleCodeExporter commented 9 years ago
The issue is (probably) that mod_pagespeed isn't using the right URL because it 
has been rewritten. The interaction between mod_pagespeed and mod_rewrite is 
explained here:
https://code.google.com/p/modpagespeed/issues/detail?id=676

Were there no other messages in the log about mod_pagespeed not being able to 
fetch the original resource?

Original comment by matterb...@google.com on 11 Nov 2013 at 5:52

GoogleCodeExporter commented 9 years ago
i don't think so but i can check again - what should i look out for? 

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 6:04

GoogleCodeExporter commented 9 years ago
hm and btw why is my comment regarding mod_proxy deleted?

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 6:05

GoogleCodeExporter commented 9 years ago
Re messages: anything that mentioned mod_pagespeed and the URL (such as 
Bild_teaser.jpg).

Re deleted comment, isn't it #3?
We almost delete comments and there's no evidence of that having been done here.

Original comment by matterb...@google.com on 11 Nov 2013 at 6:54

GoogleCodeExporter commented 9 years ago
hm strange. i see it as deleted, maybe i somehow managed to delete it myself :)

I've attached a log file i got from using:
more /var/log/httpd/error_log | grep Bild_teaser.jpg > teaser.log

Do you thinks its better to use mod_proxy? or are there issues too?

I read the interaction you posted on 
https://code.google.com/p/modpagespeed/issues/detail?id=676 but I think I didnt 
really get the gist here. 
If I use 
> more /etc/httpd/conf.d/pagespeed.conf | grep Location 

i get the ouput below. should I include the <IfModule mod_rewrite.c> 
RewriteEngine Off </IfModule> in these locations?

 <Location /mod_pagespeed_statistics>
    </Location>
    <Location /pagespeed_console>
    </Location>
    <Location /mod_pagespeed_message>
    </Location>
    ModPagespeedDownstreamCachePurgeLocationPrefix "http://localhost:8020"
<Location /mod_pagespeed_log_request_headers.js>
</Location>
<Location ~ "/mod_pagespeed_test/response_headers.html*">
</Location>
<Location /mod_pagespeed_global_statistics>
</Location>
  <Location /mod_pagespeed_beacon>
  </Location>
  <Location /mod_pagespeed_beacon>
  </Location>
<Location /mod_pagespeed_temp_statistics_graphs>
</Location>

Thanks again,

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 7:38

Attachments:

GoogleCodeExporter commented 9 years ago
I don't think you need to disable mod_rewrite for those Location's since 
they're not under /prefix so won't be affected anyway.

As for using mod_proxy, I don't know enough about it to say if it can work or 
not.

I believe we could configure mod_pagespeed to fetch (and rewrite) files under 
/prefix using various directives, but I also think that needs a later version 
of mod_pagespeed than what you're using. If you can wait, we're in the process 
of building a new stable release; if you can't wait, you can try upgrading to 
the latest beta version.

Original comment by matterb...@google.com on 11 Nov 2013 at 8:01

GoogleCodeExporter commented 9 years ago
FWIW, the URL 
http://musicasacra.lemon42.com/DE/repos/evoscripts/musica_sacra/returnBinaryImag
e/1/konzert/Bild_teaser.jpg

is served with Pragma:no-cache when fetched like this:

wget --header ModPagespeed:off --save-headers 
http://musicasacra.lemon42.com/DE/repos/evoscripts/musica_sacra/returnBinaryImag
e/1/konzert/Bild_teaser.jpg

That pragma:no-cache prevents mod_pagespeed from optimizing the resource in the 
HTML flow, changing the URL.  That explains the symptom seen by the user.

However, it appears that in this configuration, in-place resource optimization 
is enabled, and it appears to omit the pragma:no-cache header.  This is a 
little concerning.  To help us reproduce, it would be useful to understand 
where and why the pragma:no-cache is getting added in the configuration.

Original comment by jmara...@google.com on 11 Nov 2013 at 8:07

GoogleCodeExporter commented 9 years ago
Well as of now I'm using Release 1.7.30.1-beta. Will the next stable release be 
a later one? 
Is there any way to find out what triggers the "remembering not-cacheable 
status" message?

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 8:12

GoogleCodeExporter commented 9 years ago
Oh that's really strange!! Using 
http://musicasacra.lemon42.com/DE/repos/evoscripts/musica_sacra/returnBinaryImag
e/1/konzert/Bild_teaser.jpg?ModPagespeed=off I don't see the Pragma:no-cache 
header in the developer tools, but I get it from wget. I will look into that 
let you know!

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 8:21

GoogleCodeExporter commented 9 years ago
Ok so the pragma no_cache header is triggered because of the session handling. 
When you request the image in the browser you will see that a set-cookie header 
is present - but this is on the first request only.
Does google pagespeed fetch the images using cookies?

Anyway I will try to get rid of the session on the image and see what happens :)

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 8:43

GoogleCodeExporter commented 9 years ago
In the HTML-rewriting flow, the images are fetched without any cookies (or, if 
LoadFromFile is specified, then are read directly from the disk, without 
cookies).

In the in-place flow, the images are not fetched, but are collected as an 
Apache output filter, and so any cookies sent in the request will affect the 
response.

What's your intended policy about delivering images?  Do you want to see a 
valid cookie in a request before responding with any images?

Original comment by jmara...@google.com on 11 Nov 2013 at 8:48

GoogleCodeExporter commented 9 years ago
Can I get more information about those flows somewhere to help me understand it?
And regarding the images yes that was the idea - I will try to find a better 
solution, but the most important thing is that I know now what triggered the 
behaviour. 
And I will use wget for testing :)

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 8:53

GoogleCodeExporter commented 9 years ago
OK, long explanation follows....the short version is simply "I don't believe 
MPS can currently optimize resources for authorized clients but refuse to serve 
them to unauthorized clients".

In the flow where we find an <img> tag in HTML, and want to rewrite the image 
URL to point to the optimized version, we do a loopback fetch with no cookies 
to get the image content.

In the in-place flow, we let Apache handle the image request normally, but 
insert an extra output filter to collect the image bytes, optimize them, and 
store the optimized bits in a cache.  On subsequent requests, mod_pagespeed 
handles the request directly from its cache, bypassing the default handler for 
the cached resource.

Given that you only want images served to clients with a valid cookie, I think 
mod_pagespeed doesn't currently have a correct solution that optimizes your 
resource but avoids sending it to unauthorized clients.  In the cookie-less 
loopback fetch we'll either consider the response to be fully proxy-cacheable, 
or we won't optimize it at all.

The mechanism you currently have of responding with pragma:no-cache is correct, 
and makes mod_pagespeed avoid violating your privacy concern in its HTML flow.  
So as far as I can tell, mod_pagespeed is working correctly with respect to 
your policy.  In the future we might consider implementing optimization of 
private resources in the HTML flow but we definitely don't have that now.

The in-place optimization mechanism in mod_pagespeed right now appears to 
bypass your privacy control.  I *think* that rather than returning 
pragma:no-cache, you'll get the same effect by responding always with 
cache-control:private.  mod_pagespeed will respect that.

But by responding sometimes with pragma:no-cache and sometimes not, I think 
mod_pagespeed may wind up caching the response without the pragma and serving 
it to all clients.  In theory you could use Vary:Cookie in your response to 
inform proxy caches to include the cookie in the cache key.  However, 
mod_pagespeed ignores vary headers on resources by default, and if you turn on 
the switch that tells us to respect Vary, mod_pagespeed will simply give up on 
trying to cache the resource.

Original comment by jmara...@google.com on 11 Nov 2013 at 9:09

GoogleCodeExporter commented 9 years ago
Got it! Really appreciate your help!

Original comment by bernhard...@lemon42.com on 11 Nov 2013 at 9:12

GoogleCodeExporter commented 9 years ago
Bernard, one action-item for you: I would suggest you add cache-control:private 
to resources you don't want proxy-caches (e.g. CDNs and ISPs) to serve to 
unauthorized users.

I am going to rename and refocus this issue on the fact that we strip your 
pragma:no-cache when serving via the in-place flow.

Original comment by jmara...@google.com on 12 Nov 2013 at 3:46

GoogleCodeExporter commented 9 years ago
Summary was: mod_rewrite and mod_pagespeed image urls not rewritten

Original comment by jmara...@google.com on 12 Nov 2013 at 3:46

GoogleCodeExporter commented 9 years ago
Note: it's possible that the pragma:no-cache stripping happens as a result of 
caching the response to an authenticated request, and using it to respond to an 
unauthenticated request.  In that case IMO it's the responsibility of the site 
to add cache-control:private to ensure this doesn't happen.

But I want to verify that we will not strip the pragma when its delivered 
unconditionally.

Original comment by jmara...@google.com on 12 Nov 2013 at 3:49

GoogleCodeExporter commented 9 years ago
thx, already added the cache-control header, I'm aware that using pragma: 
no-cache is bad practice anyway. is there anything you still want me to test?

Original comment by bernhard...@lemon42.com on 12 Nov 2013 at 3:59

GoogleCodeExporter commented 9 years ago
No I'm all set.  I tested that this sequence works as I think it should.

1. start a local apache on port 8080 with our examples installed.
2. wget --save-headers 
http://localhost:8080/mod_pagespeed_example/images/Puzzle.jpg
   repeat three times.  The first two requests deliver the origin image (241k).  The
   third request and thereafter will deliver an optimized image (98k).
3. add "header add pragma no-cache" and restart apache
4. pagespeed will not see this header and will deliver the optimized image from
   its cache, mimicing the broken behavior that you saw.
5. flush cache (for me, touch /usr/local/apache2/pagespeed_cache/cache.flush)
6. Now no matter how many times I wget that image, it will never be optimized, 
and
   will always pass through "pragma:no-cache".

I will now check out behavior with cc:private.

Original comment by jmara...@google.com on 13 Nov 2013 at 2:27

GoogleCodeExporter commented 9 years ago
Summary was: pragma:no-cache is stripped in in-place flow.

OK, cc:private prevents optimization of this resource, even when we add

ModPagespeedRewriteUncacheableResources on

I am now hijacking this bug to fix this bug.  Note that the option is settable 
in pagespeed.conf, but is not documented.  There is implementation in the 
source-code to support other integrations (PageSpeed Service) but they are not 
live in mod_pagespeed.

See also: Issue 661

Original comment by jmara...@google.com on 13 Nov 2013 at 2:34

GoogleCodeExporter commented 9 years ago
Thinking about this further, I think this feature requires a change in 
mod_pagespeed's current IPRO implementation.

Currently IPRO has two components, a resource-generator and an output filter.  
The output filter.  The output-filter is used to collect bytes on a new 
resource and initiate optimization.  Once the optimized result is stored in 
cache, the substitution occurs in the resource generator, which runs very early 
and subverts the normal resource handling.

To enable optimization of uncacheable resources, we'd instead do the 
substitution in the output filter.  It would be a bit wasteful because the 
origin resource generator would have to fully run even when the results were 
cached, and all we'd do with the bytes is buffer them in our output filter and 
make sure their hash matches the optimized result we pulled from our cache.

This wouldn't be hard to implement (IMO).  But it would force full buffering of 
the resource in our output filter, however, because we'd want to make sure that 
the origin resource didn't change before we start streaming out pre-optimized 
bytes.

This might make it perform poorly for large resources (e.g. images) that ought 
to be streamed from the disk.  Consider a large PNG that gets optimized to a 
tiny WEBP.  We'd still have to let Apache generate the PNG fully and collect it 
in our output filter, to verify it corresponds to the same PNG we optimized to 
get a small WEBP.

And I don't see how to get around the need to run the full apache filter stack 
for the resource, considering this testcase on http://musicasacra.lemon42.com 
where cookies are used to authenticate the use before sending back the image.

We could avoid the buffering delay, however, if we considered the specification 
of ModPagespeedRewriteUncacheableResources as a signal from the site owner to 
PageSpeed that these resources don't vary in content by user (or user-agent).  
We'd still make Apache generate the bits but we could send out the optimized 
content immediately from our output filter without waiting for the full 
response from the origin resource generator.

Original comment by jmara...@google.com on 13 Nov 2013 at 2:53