apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
696 stars 158 forks source link

RespectVary ignored when cache-extending #1373

Closed eljoest closed 8 years ago

eljoest commented 8 years ago

X-Mod-Pagespeed: 1.11.33.2-0 on apache

I'm seeing partial rewrites of the HTML src attributes.

The issue can be reproduced with the request parameters ?ModPagespeed=on&ModPagespeedFilters=extend_cache,rewrite_images but also happens with the production filters:

ModPagespeedEnableFilters extend_cache,remove_comments,rewrite_images,collapse_whitespace,prioritize_critical_css,fallback_rewrite_css_urls,defer_javascript,combine_css,combine_javascript,rewrite_css,rewrite_domains

Original headers of the various images don't differ regarding cache info. This one fails but can't be compressed further:

Cache-Control: public, max-age=43200
Connection: close
Date: Tue, 09 Aug 2016 09:40:47 GMT
Accept-Ranges: bytes
ETag: "5327-5358c48346b4e"
Vary: Origin
Content-Length: 21287
Content-Type: image/jpeg
Expires: Thu, 08 Sep 2016 09:04:17 GMT
Last-Modified: Sat, 18 Jun 2016 12:05:06 GMT
X-Content-Type-Options: nosniff

This one works but is getting reduced to 6204 bytes:

Cache-Control: public, max-age=43200
Date: Tue, 09 Aug 2016 09:44:15 GMT
Accept-Ranges: bytes
ETag: "1841-5357e9b4cf669"
Vary: Origin
Content-Length: 6209
Content-Type: image/jpeg
Expires: Thu, 08 Sep 2016 09:04:17 GMT
Last-Modified: Fri, 17 Jun 2016 19:46:11 GMT
X-Content-Type-Options: nosniff

In mod_pagespeed_message I can see:

[page url]:163: Recompressing image `[original image URL].jpg' (21287 -> 21287 bytes) doesn't save space; dropped.
ipro: Recompressing image `[original image URL].jpg' (21287 -> 21287 bytes) doesn't save space; dropped.
Shrink of image `[...].png' (4887 -> 5782 bytes) doesn't save space; dropped.
oschaaf commented 8 years ago

@eljoest Could you post the html source you receive when you also turn on the debug filter?: ?ModPagespeed=on&ModPagespeedFilters=extend_cache,rewrite_images,debug

eljoest commented 8 years ago

Sorry for the long delay, the relevant section shows:

<a href="/uploads/pics/OBD4HDD.jpg" class="fbox" rel="fb" target="_blank"><img src="/typo3temp/_processed_/csm_OBD4HDD_4d2d95dc52.jpg" width="240" height="170" alt="" data-pagespeed-url-hash="1636956066" onload="pagespeed.CriticalImages.checkImageForCriticality(this);"/><!--Image does not appear to need resizing.--><!--The image was not inlined because it has too many bytes.--></a>

full source

jeffkaufman commented 8 years ago

I just tried reproducing this naively on a test site, and I do see images getting rewritten to be cache extended:

<img src=optimized.jpg.pagespeed.ce.0aKT3Yvs2R.jpg>

Copying /typo3temp/_processed_/csm_OBD4HDD_4d2d95dc52.jpg locally, I also see it getting cache extended, so it's not a problem with the image:

<img src=csm.jpg.pagespeed.ce.d9AGCf-0p-.jpg><!--Image does not appear to need resizing.--><!--The image was not inlined because device does not support inlinling.-->

The problem isn't that PageSpeed can't fetch the image or something, since it's able to tell that it doesn't need resizing, and that it's too large to inline. Additionally, when fetched from Chrome it can make a webp:

xcsm_OBD4HDD_4d2d95dc52.jpg.pagespeed.ic.y7s9JIrRdq.webp

If I load your page with extend_cache_images,debug, I still don't see it getting cache-extended, so it doesn't look like a problem with different filters interacting badly:

<img src="/typo3temp/_processed_/csm_OBD4HDD_4d2d95dc52.jpg" width="240" height="170" alt=""/>

Copying the full tag also doesn't fix it:

<div class="news-single-img"><a href="uploads/pics/OBD4HDD.jpg" class="fbox" rel="fb" target="_blank"><img src="typo3temp/_processed_/csm_OBD4HDD_4d2d95dc52.jpg.pagespeed.ce.d9AGCf-0p-.jpg" width="240" height="170" alt=""/></a><p class="news-single-imgcaption" style="width:240px;"></p></div>

I'm at a bit of a loss. Is there anything mentioning this image in your apache log file?

morlovich commented 8 years ago

How good are the original caching headers? If it's already cacheable for at least a month we won't touch it. Unusual mimetypes may matter, too.

On Wed, Aug 24, 2016 at 3:10 PM, Jeff Kaufman notifications@github.com wrote:

I just tried reproducing this naively on a test site, and I do see images getting rewritten to be cache extended:

Copying /typo3temp/processed/csm_OBD4HDD_4d2d95dc52.jpg locally, I also see it getting cache extended, so it's not a problem with the image:

The problem isn't that PageSpeed can't fetch the image or something, since it's able to tell that it doesn't need resizing, and that it's too large to inline. Additionally, when fetched from Chrome it can make a webp:

xcsm_OBD4HDD_4d2d95dc52.jpg.pagespeed.ic.y7s9JIrRdq.webp

If I load your page with extend_cache_images,debug, I still don't see it getting cache-extended, so it doesn't look like a problem with different filters interacting badly:

Copying the full tag also doesn't fix it:

I'm at a bit of a loss. Is there anything mentioning this image in your apache log file?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1373#issuecomment-242175968, or mute the thread https://github.com/notifications/unsubscribe-auth/ADl1ROl2LdmAGepGLz3BaTPi32O2apypks5qjJc1gaJpZM4Jf5w6 .

jeffkaufman commented 8 years ago

The origin response headers, with ?PageSpeed=off and requested from Firefox, are:

Accept-Ranges: bytes
Cache-Control: public, max-age=43200
Connection: keep-alive
Content-Length: 22962
Content-Type: image/jpeg
Date: Wed, 24 Aug 2016 19:32:28 GMT
Etag: "59b2-5357a6d22953f"
Expires: Fri, 23 Sep 2016 19:32:28 GMT
Last-Modified: Fri, 17 Jun 2016 14:46:56 GMT
Server: nginx
Vary: Origin
X-Backend-Server: w3ext1

Normal mime, max-age isn't that far out.

jeffkaufman commented 8 years ago

It is something about the headers, though! If I set up my local server with:

    pagespeed Domain https://www.tracetronic.de;

And reference the image directly:

<img src="https://www.tracetronic.de/typo3temp/_processed_/csm_OBD4HDD_4d2d95dc5\
2.jpg">

Then I get xcsm_OBD4HDD_4d2d95dc52.jpg.pagespeed.ic.y7s9JIrRdq.webp in chrome, but if I turn off convert_webp_to_jpeg I get csm_OBD4HDD_4d2d95dc52.jpg.

Very strange!

jeffkaufman commented 8 years ago

reproduced locally with netcat; now trying to find out which headers are responsible

jeffkaufman commented 8 years ago

It's the Vary: Origin header that's doing it.

If I have no Vary header, then it works properly.

If I have Vary: Cookie or Vary: Accept then I get "Uncacheable content, preventing rewriting" (and it also won't make a webp)

If I have an unknown Vary header like Vary: Cabbage then I get the same strange behavior as Vary: Origin.

jeffkaufman commented 8 years ago

Vary: User-Agent also has the same strange behavior as Vary: Origin and Vary: Cabbage.

I'm guessing this is something weird about how RespectVary is implemented: https://developers.google.com/speed/pagespeed/module/configuration#respectvary

eljoest commented 8 years ago

@jeffkaufman Thanks a lot for the thorough analysis. I have removed Vary: Origin for the images (mostly used for js/"static" ajax) and after wiping the cache it works as expected with my configuration. I haven't set a value for ModPagespeedRespectVary, so it seems to be a weird behavior with the default setting.

jeffkaufman commented 8 years ago

Reproduced this locally:

$ PUZZLE_ORIG=localhost:8050/mod_pagespeed_example/images/Puzzle.jpg
$ curl $PUZZLE_ORIG > /dev/null ; sleep 1 ; curl $PUZZLE_ORIG > /dev/null ; sleep 1
$ curl $PUZZLE_ORIG > mod_pagespeed_test/Puzzle-optimized.jpg
$ echo '<img src=Puzzle-optimized.jpg>' > mod_pagespeed_test/Puzzle-optimized.html

Add to test/pagespeed_test.conf.template:

    location /mod_pagespeed_test/Puzzle-optimized.jpg {
      add_header Vary Origin;
    }

Now see that it's not optimized normally:

$ curl localhost:8050/mod_pagespeed_test/Puzzle-optimized.html
<head/><img src=Puzzle-optimized.jpg>

But we'll still turn it into a webp:

$ curl -H 'Accept: image/webp' localhost:8050/mod_pagespeed_test/Puzzle-optimized.html?PageSpeedFilters=+convert_jpeg_to_webp
<head/><img src=xPuzzle-optimized.jpg.pagespeed.ic.jjOWf4PvS4.webp>

And we don't actually have to start with an optimized image; it's enough to ask about cache-extension:

    location /mod_pagespeed_example/images/Puzzle.jpg {
      add_header Vary Origin;
    }

$ curl localhost:8050/mod_pagespeed_test/Puzzle.html
<head/><img src=http://localhost:8050/mod_pagespeed_example/images/xPuzzle.jpg.pagespeed.ic.YsUBdV-i_6.jpg>
$ curl localhost:8050/mod_pagespeed_test/Puzzle.html?PageSpeedFilters=extend_cache
<img src=../mod_pagespeed_example/images/Puzzle.jpg>

So: it looks like it we have a resource with a Vary header then we won't cache extend it, but if we can optimize it at all then we will do so, which includes cache-extending it.

jeffkaufman commented 8 years ago

Here's a unit test that captures this:

TEST_F(CacheExtenderTest, VaryOrigin) {
  options()->EnableExtendCacheFilters();
  rewrite_driver()->AddFilters();

  GoogleString url = AbsolutifyUrl("a.css");
  ResponseHeaders response_headers;
  DefaultResponseHeaders(kContentTypeCss, 100 /* ttl */, &response_headers);
  response_headers.Add("Vary", "Origin");
  SetFetchResponse(url, response_headers, "css file");

  GoogleString cache_extended_css = Encode("", kFilterId, "0", "a.css", "css");
  ValidateExpected("vary origin",
                   StringPrintf(kCssFormat, "a.css"),
                   StringPrintf(kCssFormat, cache_extended_css.c_str()));
}
jeffkaufman commented 8 years ago

To decide whether it's ok to cache-extend, cache_extender.cc checks headers->IsProxyCacheable(), which I see returning false if I set Vary: Origin. Our optimization filters probably do something else?

jeffkaufman commented 8 years ago

CacheableResourceBase::IsValidAndCacheableImpl calls headers->IsProxyCacheable(req_properties, respect_vary(), ResponseHeaders::kNoValidator) while cache_extender.cc calls headers->IsProxyCacheable(). And "The zero-arg version of IsProxyCacheable gives a pessimistic answer, assuming the request has cookies, there is no validator, and we respect Vary."

So the problem is that in cache extending we ignore the RespectVary configuration setting, and just assume the value is true, which is especially confusing since the default value is false.

jeffkaufman commented 8 years ago

All the calls to the zero-arg version of IsProxyCachable are suspect; I'll go through them.

jeffkaufman commented 8 years ago

Cases to fix:

jeffkaufman commented 8 years ago

@eljoest

I haven't set a value for ModPagespeedRespectVary, so it seems to be a weird behavior with the default setting.

Yes, it's a bug, and I'm working on a patch.

Thanks for reporting this!

jmarantz commented 8 years ago

quick question -- should we get rid of the zero-arg version of IsProxyCacheable()?

On Thu, Aug 25, 2016 at 4:04 PM, Jeff Kaufman notifications@github.com wrote:

Closed #1373 https://github.com/pagespeed/mod_pagespeed/issues/1373 via dfda935 https://github.com/pagespeed/mod_pagespeed/commit/dfda93522ded43e33d4eb641586515d647641b86 .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1373#event-767994931, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2kPbv_jac4PBOy4ixR4Aw6I02VugZRks5qjfVUgaJpZM4Jf5w6 .

jeffkaufman commented 8 years ago

@jmarantz The zero-arg version makes sense in the places it's still being used: (1) tests and (2) handling resources pagespeed creates (where respect-vary shouldn't apply).