apache / incubator-pagespeed-mod

Apache module for rewriting web pages to reduce latency and bandwidth.
http://modpagespeed.com
Apache License 2.0
696 stars 158 forks source link

Invalidate metadata cache entry on reconstruction failure. #1145

Open Dead2 opened 9 years ago

Dead2 commented 9 years ago

This problem occurred on a production server today, running Apache 2.4.16 and mod-pagespeed-stable-1.9.32.4-7251.x86_64.

Error log:

[Fri Sep 25 11:45:21.810672 2015] [pagespeed:error] [pid 874779] [mod_pagespeed 1.9.32.4-7251 @874779] Failed to stat /kunder/XX_682/YY_1250/public/gtl/blog/css/mediaquery.css: No such file or directory
[Fri Sep 25 11:45:21.810904 2015] [pagespeed:warn] [pid 874779] [mod_pagespeed 1.9.32.4-7251 @874779] [0925/114521:WARNING:resource_fetch.cc(197)] Fetch failed for resource url http://www.XYZ.no/skin/frontend/moero/clean/gtl,_blog,_css,_style.css+,_gtl,_blog,_css,_mediaquery.css.pagespeed.cc.1EUVqRsEkS.css
[Fri Sep 25 11:45:21.810986 2015] [pagespeed:warn] [pid 874779] [mod_pagespeed 1.9.32.4-7251 @874779] Fetch failed for http://www.XYZ.no/skin/frontend/moero/clean/gtl,_blog,_css,_style.css+,_gtl,_blog,_css,_mediaquery.css.pagespeed.cc.1EUVqRsEkS.css, status=404

Now, this is partially understandable since the file and path (from gtl) does not exist. As far as I can tell, the htaccess rewrite settings have changed, but I am not able to figure out exactly how to trigger this case.

The problem is that pagespeed has cached this combined resource and keeps referring to it in the html output, thus resulting in a seemingly permanently broken webpage. To resolve this I had to delete the entire pagespeed cache folder and restart apache, after that it works again.

The interesting thing is that the new combined file's filename (including hash) is identical to the old, so why did it fail before? Does pagespeed cache the source filepaths with the combined file? The css urls in the html did not change, neither did the combined file url, so something must be erronously cached somewhere.

I would expect that upon first failing the stat, it would make sure this combined resource is removed from the cache and that the combined css file will be re-generated with the up-to-date files. (If it returns a 404 on the first access would be ok, as long as it can recover for the following pageviews.)

Relevant settings deviating from the default:

<IfModule pagespeed_module>
    ModPagespeed on
    ModPagespeedRespectXForwardedProto on
    ModPagespeedInPlaceResourceOptimization on
    ModPagespeedRewriteLevel PassThrough

    ModPagespeedLRUCacheKbPerProcess   512
    ModPagespeedDefaultSharedMemoryCacheKB 125000
    ModPagespeedFileCachePath "/var/cache/mod_pagespeed/"

    ModPagespeedLoadFromFileRuleMatch disallow .*
    ModPagespeedLoadFromFileRuleMatch allow \.(css|js|png|jpg|webp|gif|ico|ttf|woff)$

    ModPagespeedNumRewriteThreads 3
    ModPagespeedNumExpensiveRewriteThreads 1
    ModPagespeedMaxSegmentLength 2048
    ModPagespeedFileCacheInodeLimit 250000
    ModPagespeedRewriteRandomDropPercentage 10
    ModPagespeedRewriteDeadlinePerFlushMs 16
</IfModule>

<IfModule pagespeed_module>
    ModPagespeedCombineAcrossPaths on
    ModPagespeedEnableFilters recompress_png,insert_dns_prefetch,convert_gif_to_png,convert_jpeg_to_progressive,convert_jpeg_to_webp,recompress_jpeg,make_google_analytics_async,recompress_webp,resize_images,flatten_css_imports,inline_css,inline_import_to_link,move_css_above_scripts,move_css_to_head,combine_css,combine_javascript,inline_javascript,rewrite_javascript,remove_comments,remove_quotes,prioritize_critical_css,elide_attributes,collapse_whitespace,convert_to_webp_lossless,inline_google_font_css
</IfModule>
jeffkaufman commented 8 years ago

I'm seeing LoadFromFileRuleMatch statements but no LoadFromFile statements; do you have one of those as well?

Did something happen with /kunder/XX_682/YY_1250/public/gtl/blog/css/mediaquery.css ?

Dead2 commented 8 years ago

Oh, sorry about that. I seem to have forgot to include a whole block of the config.

<IfModule pagespeed_module>
ModPagespeedFileCachePath '/kunder/XX_682/YY_1250/pagespeed-cache'
ModPagespeedLogDir '/kunder/XX_682/YY_1250/logs/'
ModPagespeedMapOriginDomain 'http://YY_1250.raskZZ.raskesider.no' 'http://YY_1250.raskZZ.raskesider.no'
ModPagespeedLoadFromFile 'http://YY_1250.raskZZ.raskesider.no/' '/kunder/XX_682/YY_1250/public'
</IfModule>

There are several more MapOriginDomain and LoadFromFile statements (several vhost aliases), but all of them point to the same url/path.

The file and folder /kunder/XX_682/YY_1250/public/gtl/blog/css/mediaquery.css does not exist. Whether it did exist at some point is unkown, but it might not have, or it might have been moved.

This was a while ago so my memory on this is a bit fuzzy, but the error did not make any sense to me. The file was gone along with the preceding folders, but pagespeed kept outputting the combined resource in the html despite failing the stat checking of the files when the resource was requested.

jeffkaufman commented 8 years ago

Yes, this doesn't sound like what we want to happen, but I think I see how it could happen.

PageSpeed has an "output cache" where it stores optimized resources. http://www.XYZ.no/skin/frontend/moero/clean/gtl,_blog,_css,_style.css+,_gtl,_blog,_css,_mediaquery.css.pagespeed.cc.1EUVqRsEkS.css , call it a+b+c, should have been in that cache, but it sounds like it was evicted. Which is ok.

PageSpeed also has a "metadata cache" where it stores things like "a, b, and c combine together and would be in the output cache under a+b+c". When generating the html we see that in the metadata cache and produce the combined resource. Then when a request actually comes in for a+b+c we see it's not in cache and try to reconstruct it. If we succeed, it's back in the "output cache" and we're in good shape again. Unfortunately, in this case we couldn't reconstruct a+b+c because in the meantime a file has been moved or removed and we can't find, say, b (stat fails). Now we're stuck.

Dead2 commented 8 years ago

That explains most of it, but not the part about deleting the cache and reloading the page a couple times makes it produce the exact same combined resource again, successfully.

That is why I was thinking the rewrite settings changed or some odd thing, since it seems to be able to actually find the resource through http but not when stat'ing directly. Does the output cache store where it got a file (http/direct)? Did it somehow assume it would exist in the filesystem and not fall back to http when it didn't? (BTW: mod_pagespeed settings did not change, that much I know)

I can look into the code of the website in a few days, when I am back at work. I do remember not understanding how that file was successfully re-cached after the cache-wipe, but the problem was gone and I had to move on to look into a problem on another server.

jeffkaufman commented 8 years ago

Some discussion on the dev mailing list: https://groups.google.com/forum/#!msg/pagespeed-dev/clp0OTLIZ8Q/F3HdmHkNAgAJ

jeffkaufman commented 8 years ago

Did it somehow assume it would exist in the filesystem and not fall back to http when it didn't?

Yes, that's a thing we'll do. If IPRO is on, which became on by default in 1.9, we'll store resources we see in cache as they flow through the server. But if we need to reconstruct something that's not in cache and you have LoadFromFile enabled for that path we'll check disk but not http.

So: is everything on http://YY_1250.raskZZ.raskesider.no/ matching \.css|js|png|jpg|webp|gif|ico|ttf|woff)$ stored under /kunder/XX_682/YY_1250/public?

Dead2 commented 8 years ago

Yes, those two should match perfectly, except the http one could have mod_rewrite doing some magic. I have checked the backups from before that date, and there was no rewrite that should affect those files.

Also, I might have figured out why it looks for /gtl/blog/css/mediaquery.css instead of /skin/frontend/moero/clean/gtl/blog/css/mediaquery.css

In the html source (php generated) it is "http://www.XYZ.no/skin/frontend/moero/clean//gtl/blog/css/mediaquery.css", notice the double slash. Possibly that makes something confused and the path becomes "http://www.XYZ.no/gtl/blog/css/mediaquery.css" (that has never existed).

That also explains why I could not find that path myself and thought it had been moved. It had not been moved, and I see no relevant changes in the files/paths in the days before the error.

I still do not understand what caused this error to suddenly occur and then work again after a cache wipe. This might have been related to changes of what mod_pagespeed filters were enabled, but I have no logs of what might have been changed/tested.

Dead2 commented 8 years ago

Another bit of info that might be relevant is that the html references the same file twice, one with the double slash and one without.

<link rel=stylesheet type="text/css" href="http://www.XYZ.no/skin/frontend/moero/clean/gtl/blog/css/mediaquery.css" media=all />
<link rel=stylesheet type="text/css" href="http://www.XYZ.no/skin/frontend/moero/clean//gtl/blog/css/mediaquery.css" media=screen />
jeffkaufman commented 8 years ago

Summary: this is a bug, and it's on our list of things to fix.

jeffkaufman commented 8 years ago

Thinking about how to test this:

1) make sure you're not running any sort of in-memory http cache 2) fetch a page with a resource until you see it optimized 3) manually delete the resource from disk and cache 4) fetch the page, observe the html still references the optimized version 5) fetch the optimized version, get 404 6) fetch the page, observe the html (if broken then "still", if fixed "no longer") references the optimized version

Might as well test this in the combining case, both because combining is more complex and because this is where it's likely to bite people (because a+b+c 404s until metadata cache expiry when c is missing, when without pagespeed they'd just see c 404ing)

jeffkaufman commented 8 years ago

I'm now thinking this is lower priority than I had thought: it only applies when a resource disappears from disk but the html continues to reference it, which should be very rare.

Dead2 commented 8 years ago

I think you need to re-read my two last comments. From what I could gather, the file it gets a 404 on never actually existed. And that seems to be because of a double / (/.../clean/gtl/...) in the url from the html, and mod_pagespeed thinks that means /gtl/... (stripping the path prior to the //).

Perhaps things get extra confused when it is referenced both with the correct path and the one with the double slash (at least it sure did confuse me). Whether that actually causes the error or just makes it harder to find/understand I do not know. Problem disappeared after the extra slash was removed.

Whether this was just one problem or two, I do not know. But I suspect some url parser bug at least.

jeffkaufman commented 8 years ago

That actually sounds like the bug we fixed in 1.9.32.14 / 1.10.33.7. Have you seen it on those or later versions?

Dead2 commented 8 years ago

No, we have not seen this problem again, but since the html code was also fixed that might not mean anything. In any case, good to hear. :+1:

You might close this issue, unless you want to keep track of the other potential bug.

jeffkaufman commented 7 years ago

I'm now thinking this is lower priority than I had thought: it only applies when a resource disappears from disk but the html continues to reference it, which should be very rare.

This is actually not as rare as I had thought. Here's a situation that could give you this problem:

1) Fetching isn't working, so resources only get into the cache via (ipro) recording 2) The pagespeed cache usage is large enough that the cache eventually gets cleaned 3) There's some sort of http cache between pagespeed and the users so that output resources aren't freshened

Now the cache cleaner removes the .pagespeed. resource for something, and the input resource. We can't reconstruct it because we don't have fetching. We'll still emit .pagespeed. resource references in the html, even though they'll 404, and we'll be stuck in this state until the cache expiration time of the input resource.

jeffkaufman commented 7 years ago

Things we could do when we have to 404 a resource:

jeffkaufman commented 7 years ago

Talking to @jmarantz here's a robust but slightly hackish way we could fix this. On failing to reconstruct a .pagespeed. resource either:

1) Issue a redirect to the underlying resource. The user will get what they want, and we'll ipro-record the resource in cache for next time.

2) if there are multiple underlying resources, or it's not a direct substitution, then take the shortest input resource url and add it to the purge set. This 404 will still go out (not ideal, but unavoidable) but the next time we're considering using the relevant metadata cache entry we'll see it as invalidated.

morlovich commented 7 years ago

We should also be able to just delete the metadata cache entry w/o touching the purge set, since we do have its key when serving a .pagespeed. resource (though that doesn't prevent the problem from reoccurring; OTOH I would be worried about the size of the invalidation set in the other case...)

On Thu, Nov 17, 2016 at 3:06 PM, Jeff Kaufman notifications@github.com wrote:

Talking to @jmarantz https://github.com/jmarantz here's a robust but slightly hackish way we could fix this. On failing to reconstruct a .pagespeed. resource either:

1) Issue a redirect to the underlying resource. The user will get what they want, and we'll ipro-record the resource in cache for next time.

2) if there are multiple underlying resources, or it's not a direct substitution, then take the shortest input resource url and add it to the purge set. This 404 will still go out (not ideal, but unavoidable) but the next time we're considering using the relevant metadata cache entry we'll see it as invalidated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1145#issuecomment-261354523, or mute the thread https://github.com/notifications/unsubscribe-auth/ADl1RB8ysq4b6glwz2jgZqm5U0mPNUaDks5q_LO3gaJpZM4GDw6M .

jmarantz commented 7 years ago

Another issue is that cache purging is not enabled by default. Cache flushing is, but that is a big hammer. The reason for that is I didn't want to break people's scripts if they were already using cache flushing.

We could support both, but when we reviewed purging we thought that was too complex.

On Nov 17, 2016 3:12 PM, "Maks Orlovich" notifications@github.com wrote:

We should also be able to just delete the metadata cache entry w/o touching the purge set, since we do have its key when serving a .pagespeed. resource (though that doesn't prevent the problem from reoccurring; OTOH I would be worried about the size of the invalidation set in the other case...)

On Thu, Nov 17, 2016 at 3:06 PM, Jeff Kaufman notifications@github.com wrote:

Talking to @jmarantz https://github.com/jmarantz here's a robust but slightly hackish way we could fix this. On failing to reconstruct a .pagespeed. resource either:

1) Issue a redirect to the underlying resource. The user will get what they want, and we'll ipro-record the resource in cache for next time.

2) if there are multiple underlying resources, or it's not a direct substitution, then take the shortest input resource url and add it to the purge set. This 404 will still go out (not ideal, but unavoidable) but the next time we're considering using the relevant metadata cache entry we'll see it as invalidated.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1145# issuecomment-261354523, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADl1RB8ysq4b6glwz2jgZqm5U0mPNUaDks5q_LO3gaJpZM4GDw6M .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1145#issuecomment-261356352, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2kPeAQsxQ_SFYPC5tdPV--6iW8D6u7ks5q_LVHgaJpZM4GDw6M .

jmarantz commented 7 years ago

The deletion of the metadata cache key seems like a good idea, though we'd need to delete it in every instance of the cache. If there was an LRUCache with each child process that would be hard. If we are going to go this route we want to eliminate the LRUCache as an option.

jmarantz commented 7 years ago

Another possible strategy for a fix: avoid pagespeed-rewrites for resources recorded via the ipro recorder. consider them a miss in that context.

E.g. we could put the ipro-recorded stuff in a different cache fragment.

morlovich commented 7 years ago

On Sun, Nov 20, 2016 at 12:42 PM, Joshua Marantz notifications@github.com wrote:

Another possible strategy for a fix: avoid pagespeed-rewrites for resources recorded via the ipro recorder. consider them a miss in that context.

Might be messy since we use a regular rewrite as a nested one in the ipro case....

jeffkaufman commented 7 years ago

avoid pagespeed-rewrites for resources recorded via the ipro recorder

That would be too bad: recording is more efficient than fetching, and right now with ipro on we may be able to avoid some fetching.

Actually, no. We'll initiate the fetch when processing the html, and initiate the recording when we receive the request for the resource, so ipro won't avoid fetches. Though we could add something that disabled fetching-in-response-to-html to get the efficiency gain I was thinking about.

jeffkaufman commented 7 years ago

Ok, here's another way of thinking about it. This is only a problem in cases where fetching is broken but recording is working. So what's better to do in this case?

jmarantz commented 7 years ago

The invalidation idea is not necessarily that easy. I was thinking of doing something expedient and correct. I am not sure if my proposal is expedient either though.

On Nov 21, 2016 8:50 AM, "Jeff Kaufman" notifications@github.com wrote:

Ok, here's another way of thinking about it. This is only a problem in cases where fetching is broken but recording is working. So what's better to do in this case?

  • Not produce any .pagespeed. resources: no longcaching, no combining, etc
  • Be at risk that .pagespeed. resources will fall out of cache: redirect for single resources, 404 once for combined resources

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pagespeed/mod_pagespeed/issues/1145#issuecomment-261943367, or mute the thread https://github.com/notifications/unsubscribe-auth/AB2kPUkY0S8o9ghtpMMDqHCCZhgP3vsEks5rAaG7gaJpZM4GDw6M .

jmarantz commented 7 years ago

I am now creating a repro testcase, not sure yet of the best way to fix it

morlovich commented 7 years ago

So some thought dump:

jmarantz commented 4 years ago

https://github.com/apache/incubator-pagespeed-ngx/issues/1686 is I think a great success-story for how to resolve this issue in a deployment. The punchline is in the last few messages in that issue.