apache / incubator-pagespeed-ngx

Automatic PageSpeed optimization module for Nginx
http://ngxpagespeed.com/
Apache License 2.0
4.37k stars 362 forks source link

Crashes on custom 404s for .pagespeed. resources #1081

Closed liluxdev closed 8 years ago

liluxdev commented 8 years ago

I recently upgraded to 1.10.33.1, but I had to switch back to 1.9.32.10-7423 (last version i was using) due to following errors, that caused my nginx to freeze after few hours of high traffic (nginx daemon didn't died, but was unable to serve any request)

2015/12/21 13:15:02 [error] 29209#0: [ngx_pagespeed 1.10.33.1-7580] Slow ReadFile operation on file /home/pagespeed_cache..omissis..t/it/_.pagespeed.jo.RAuz1oeDfS.js,: 68.457ms; configure SlowFileLatencyUs to change threshold

2015/12/21 13:15:02 [error] 29209#0: [ngx_pagespeed 1.10.33.1-7580] Slow write operation on file /home/pagespeed_cache/....omissis...

2015/12/21 13:16:52 [alert] 11520#0: worker process 29526 exited on signal 11 (core dumped)
2015/12/21 13:17:22 [alert] 11520#0: worker process 29491 exited on signal 11 (core dumped)
2015/12/21 13:20:48 [alert] 11520#0: worker process 29558 exited on signal 11 (core dumped)
2015/12/21 13:22:36 [alert] 11520#0: worker process 29209 exited on signal 11 (core dumped)
2015/12/21 13:23:04 [alert] 11520#0: worker process 29504 exited on signal 11 (core dumped)
2015/12/21 13:23:40 [alert] 11520#0: worker process 29501 exited on signal 11 (core dumped)
2015/12/21 13:33:48 [alert] 11520#0: worker process 29673 exited on signal 11 (core dumped)
2015/12/21 13:34:18 [alert] 11520#0: worker process 29812 exited on signal 11 (core dumped)
2015/12/21 13:34:18 [alert] 11520#0: worker process 29931 exited on signal 11 (core dumped)
2015/12/21 13:34:27 [alert] 11520#0: worker process 29917 exited on signal 11 (core dumped)
2015/12/21 13:34:29 [alert] 11520#0: worker process 30298 exited on signal 11 (core dumped)
2015/12/21 13:34:38 [alert] 11520#0: worker process 30327 exited on signal 11 (core dumped)

I'm using nginx 1.8.0 but i tried also to go with 1.9.9, with no luck, now I think something new in pagespeed is the issue for the signal 11 issue as I'm not getting it anymore after switching back to 1.9.32.10

jeffkaufman commented 8 years ago

Yikes, sorry about that!

Were there any other ngx_pagespeed logs around when this happened?

Do you have the core dump it says it took?

Would you be up for running a 1.10 compiled with --with-debug so we can get cores and backtraces with symbols?

liluxdev commented 8 years ago

Yes i can try to test it compiling pagespeed --with-debug and send you the core dump, but please teach me how to get it or send me a pointer to some how-to for getting the core dump file, I'm not a C developer: I just compile nignx and pagespeed to use on my server.

So, I don't have a 'core dump' for the case, as I don't know very well how to get it saved to a file, i searched a bit and i tried to put this in nginx.conf:

worker_rlimit_core 500M; working_directory /usr/local/nginx-coredumps;

But the directory was empty when the issue happened (maybe because the main process didn't died but just became unable to serve requests?)

Is this enough? I still didn't compiled with --with-debug, was this the reason i was unable to see the core dumps?

After recompiling with debug, is debugging with gdb a good way too to get the data you need? What command I can issue to do this? Is #gdb /path/to/binary/nginx/ enough?

Thanks Stefano

oschaaf commented 8 years ago

@creativeprogramming When you have a core dump, this should give you a backtrace:

gdb /path/to/nginx /path/to/cores/nginx.core

And then, when gdb has started:

bt

Having nginx compiled with --debug would help debugging a lot as it will give us easier to understand backtraces. Seeing the messages in your logs, I would have expected a core dump to exist in the configured directory.

If you want, we could try to figure this out together via instant messaging

liluxdev commented 8 years ago

I finally managed to get the core dumps files, before posting you the backtrace details i write here how i solved my issue for people coming from search engines ad to keep a public note for myself too

It was all due to ulimits:

tail -f /var/log/apport.log

ERROR: apport (pid 1443) Thu Dec 24 01:05:03 2015: executable: /usr/local/nginx-pagespeed/nginx (command line "nginx:\ worker\ process")
ERROR: apport (pid 1443) Thu Dec 24 01:05:03 2015: executable does not belong to a package, ignoring
ERROR: apport (pid 1443) Thu Dec 24 01:05:03 2015: writing core dump to /usr/local/nginx-coredumps/core (limit: 524288000)
ERROR: apport (pid 1443) Thu Dec 24 01:05:05 2015: aborting core dump writing, size exceeds current limit 524288000

so I'd to increase those limits

vim nginx.conf
worker_rlimit_core 1000M;
working_directory /usr/local/nginx-coredumps;
vim /etc/security/limits.conf
root hard core unlimited
* hard core unlimited
root soft core unlimited
* soft core unlimited
chown www-data:www-data -R  /usr/local/nginx-coredumps
chmod 777 -R  /usr/local/nginx-coredumps
reboot
liluxdev commented 8 years ago

here comes more details

let's start with nginx error.log just seconds after nginx restart:

echo "" > /var/log/nginx/error.log && service nginx restart
root@j0:~# tail -n 200 -f /var/log/nginx/error.log

2015/12/24 01:36:39 [info] 2630#0: pagespeed: rollback gzip, explicit configuration in /etc/nginx/nginx.conf:123
2015/12/24 01:36:48 [alert] 2635#0: worker process 2636 exited on signal 11 (core dumped)
2015/12/24 01:36:49 [alert] 2635#0: worker process 2637 exited on signal 11 (core dumped)
2015/12/24 01:36:52 [alert] 2635#0: worker process 2639 exited on signal 11 (core dumped)
2015/12/24 01:36:53 [alert] 2635#0: worker process 2661 exited on signal 11 (core dumped)

sometimes i get 'no threading detected' message at startup and slow read/write as i previously posted, but this time not

liluxdev commented 8 years ago

@jeffkaufman @oschaaf here's the backtrace

gdb /usr/local/nginx-pagespeed/nginx /usr/local/nginx-coredumps/core                
GNU gdb (Ubuntu 7.9-1ubuntu1) 7.9
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/nginx-pagespeed/nginx...done.
[New LWP 1560]
[New LWP 1570]
[New LWP 1579]
[New LWP 1580]
[New LWP 1581]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `nginx: worker pr'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  net_instaweb::NgxBaseFetch::ReadCallback (data=...)
    at /root/nginx-source/ngx_pagespeed-release-1.10.33.1-beta//src/ngx_base_fet               ch.cc:143
143       CHECK(data.sender == ctx->base_fetch);
(gdb) bt
#0  net_instaweb::NgxBaseFetch::ReadCallback (data=...)
    at /root/nginx-source/ngx_pagespeed-release-1.10.33.1-beta//src/ngx_base_fet               ch.cc:143
#1  0x00000000004a1d47 in net_instaweb::NgxEventConnection::ReadAndNotify (
    fd=15)
    at /root/nginx-source/ngx_pagespeed-release-1.10.33.1-beta//src/ngx_event_co               nnection.cc:122
#2  0x00000000004a1e0e in net_instaweb::NgxEventConnection::ReadEventHandler (
    ev=0x21bdc10)
    at /root/nginx-source/ngx_pagespeed-release-1.10.33.1-beta//src/ngx_event_co               nnection.cc:83
#3  0x000000000042af32 in ngx_event_process_posted (
    cycle=cycle@entry=0x1f77bd0, posted=0x150af60 <ngx_posted_events>)
    at src/event/ngx_event_posted.c:33
#4  0x000000000042aa0c in ngx_process_events_and_timers (
    cycle=cycle@entry=0x1f77bd0) at src/event/ngx_event.c:265
#5  0x00000000004319a6 in ngx_worker_process_cycle (
    cycle=cycle@entry=0x1f77bd0, data=data@entry=0x0)
    at src/os/unix/ngx_process_cycle.c:767
#6  0x00000000004301e9 in ngx_spawn_process (cycle=cycle@entry=0x1f77bd0,
    proc=proc@entry=0x4318b2 <ngx_worker_process_cycle>, data=data@entry=0x0,
    name=name@entry=0xce2136 "worker process", respawn=respawn@entry=-3)
    at src/os/unix/ngx_process.c:198
---Type <return> to continue, or q <return> to quit---
#7  0x0000000000431b0a in ngx_start_worker_processes (
    cycle=cycle@entry=0x1f77bd0, n=6, type=type@entry=-3)
    at src/os/unix/ngx_process_cycle.c:357
#8  0x00000000004325fa in ngx_master_process_cycle (
    cycle=cycle@entry=0x1f77bd0) at src/os/unix/ngx_process_cycle.c:129
#9  0x0000000000411157 in main (argc=<optimized out>, argv=<optimized out>)
    at src/core/nginx.c:419
liluxdev commented 8 years ago

my nginx compile info:

/usr/local/nginx-pagespeed/nginx -V
nginx version: nginx/1.8.0
built by gcc 4.9.2 (Ubuntu 4.9.2-10ubuntu13)
built with OpenSSL 1.0.1f 6 Jan 2014
TLS SNI support enabled
configure arguments: --with-debug --add-module=/root/nginx-source/naxsi/naxsi-0.54/naxsi_src/ --error-log-path=/var/log/nginx/error.log --http-client-body-temp-path=/var/run/nginx/body --http-fastcgi-temp-path=/var/run/nginx-cache/fastcgi --http-log-path=/var/log/nginx/access.log --http-proxy-temp-path=/var/run/nginx-cache/proxy --lock-path=/var/lock/nginx.lock --with-http_stub_status_module --with-http_spdy_module --with-http_ssl_module --prefix=/usr/local/nginx-pagespeed --sbin-path=/usr/local/nginx-pagespeed/nginx --conf-path=/etc/nginx/nginx.conf --pid-path=/var/run/nginx.pid --add-module=/root/nginx-source/ngx_cache_purge/ngx_cache_purge/ --add-module=/root/nginx-source/ngx_pagespeed-release-1.10.33.1-beta/ --with-http_gzip_static_module --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --with-http_sub_module --without-mail_pop3_module --without-mail_smtp_module --without-mail_imap_module --without-http_uwsgi_module --without-http_scgi_module --with-ipv6 --prefix=/usr

Note: i tried also to complie nginx without any other third party module (e.g. naxsi, purge etc.) the problem persists

liluxdev commented 8 years ago

Tell me if you need more

liluxdev commented 8 years ago

PS. no rush for me: I'm running anyway very fine with 11.9.32.10 i've no urgency to upgrade, so, keep up the good work and thank you for it.

oschaaf commented 8 years ago

Thanks a lot for diving in and getting out the backtrace with gdb.

1.10 contains changes that makes ngx_pagespeed use a single named pipe to communitcate with PageSpeed instead of using a named pipe per request.

Looking at the backtrace, what happens is that an assumption I made in that change fails: we receive an event from the pipe originating from a base fetch that is no longer associated with the active request, while the nginx side has not released the base fetch associated to the event.

To properly handle that situation, I need to precisely understand what leads up to that. Would you be able to reproduce one more time, but now with the logging level turned up to "debug"? So put something like this in nginx.conf?:

error_log /path/to/log debug;

And subsequently run with a debug build? Note that the will probably generate a LOT of logging though, so you have to keep an eye on that. Having your (anonymized) nginx.conf would also be helpful.

Meantime, I will try to reproduce this myself. After some thought I have a suspicion about what may be leading up to this (nginx overwriting the complete request context in some situations). I'll update here if I succeed in that.

oschaaf commented 8 years ago

Cross linking related post in pagespeed-dev: https://groups.google.com/forum/#!topic/pagespeed-dev/FUjujbqH-hw

liluxdev commented 8 years ago

Here are some minutes of debug log: error.debug.log.txt

liluxdev commented 8 years ago

here's nginx.conf (nevermind the comments, it's a bit messy)

#TODO: check buffers and other things described here: http://www.cyberciti.biz/tips/linux-unix-bsd-nginx-webserver-security.html
user www-data www-data;
worker_processes  6;

worker_rlimit_core  1M;
working_directory   /usr/local/nginx-coredumps;

#thread_pool default threads=64 max_queue=65536;

#server_tokens off
#error_log  /var/log/nginx/error.debug.log debug;
error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
    multi_accept on; # on 14.10.2014 see:http://seravo.fi/2013/optimizing-web-server-performance-with-nginx-and-php

    # Linux performance awesomeness on
    use epoll;

}

http {
 include                        /etc/nginx/naxsi_core.rules;

log_format vhostslogging '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent  "$http_referer" "$http_user_agent" $request_time for $server_name';

    log_format my_post_tracking $request_body;
    server_tokens off;
    include       /etc/nginx/mime.types;
    default_type application/octet-stream;

#include /etc/nginx/passenger.conf;

#added 6 sept 2013 as joomla isotopic send server seems to go in error: upstream sent too big header while reading response header from upstream 
#see also: http://stackoverflow.com/questions/13894386/upstream-too-big-nginx-codeigniter

#proxy_buffer_size   128k;
#proxy_buffers   4 256k;
#proxy_busy_buffers_size   256k;
proxy_buffering off;

#end added 6 sept 2013 as joomla isotopic send server seems to go in error: upstream sent too big header while reading response header from upstream 
#see also: http://stackoverflow.com/questions/13894386/upstream-too-big-nginx-codeigniter

#  sendfile        on;
# before 17 May 2013 was sendfile on, and aio line not present!!! i'm testing async io
# should improve cpu usage and troughput, see here: http://blog.lighttpd.net/articles/2006/11/09/async-io-on-linux/
#sendfile off; # on before 17 May 2013 18.20
#AIO should be good for large files but disables linux VFS cache TEMP EXPERIMENT
#On Linux, AIO is usable starting from kernel version 2.6.22; plus, it is also necessary to enable directio, otherwise reading will be blocking:
#directio on; # not present before 17 May 2013
#aio on; # not present before 17 May 2013

#experimental performance 17 May 2013
    client_body_buffer_size 128k; #http://nginx.org/en/docs/http/ngx_http_core_module.html#client_body_buffer_size 
#Sets buffer size for reading client request body. In case request body is larger than the buffer, the whole body or only its part is written to a temporary file.
    large_client_header_buffers 4 128k;
#end experimental performance 17 may 2013

#experimental protection 17 May 2013
 ## Start: Timeouts ##
  client_body_timeout   20s;
  client_header_timeout 20s;
  #send_timeout          10; #36s see below
## End: Timeouts ##
#end experimental protection 17 may 2013

#http://wiki.nginx.org/ReadMoreAboutTcpNopush
    #tcp_nopush off; #before 23 mar 2013 was on
#TODO: ON FOR WEBSOCKETS
    tcp_nodelay on; #before 23 mar 2013 was offi
#The tcp_nodelay disables the Nagle buffering algorithm. It is usable when the server doesn’t require a response from the client. 
#General web use does need a response, so this should be turned off, unless you need to send out small bursts of information, like tracking mouse movements.

#        server_names_hash_bucket_size 64;
    types_hash_max_size 2048;
    # server_tokens off;

 # Where to store the body of large client requests on disk
  # NGINX will stream this to disk before posting it to your Mongrels,
  # preventing slow clients tying up your app.
  client_body_temp_path /tmp/nginx-client-body 1 2;

  # Max size of a request from a client (usually a POST).  This will limit
  # the size of file uploads to your app
  client_max_body_size 32m;

  access_log    /var/log/nginx/access.log;

#Optional otpimization for avoid too much handshakes, 
#see: http://nginx.org/en/docs/http/configuring_https_servers.html
   ssl_session_cache    shared:SSL:128m; #1m = around 4000 ssl sessions
    # server_names_hash_bucket_size 64;
    # server_name_in_redirect off;

#added 23 mar 2013 open_file_cache to try improve wait time (first byte of static)
 open_file_cache max=30000 inactive=120s; 
 open_file_cache_valid    120s; 
 open_file_cache_min_uses 1;
 open_file_cache_errors   on;

 send_timeout 60s; #was 36s until 15.10.2014

#end 23 mar 2013

    #keepalive_timeout  0;
    #keepalive_timeout  65; #default
    #keepalive_timeout  120s; <-- was this until 17 may 2013
     keepalive_timeout  120s 60s; #13 may added header timeout to convince some browsers to close the keepalive
     keepalive_requests  9999; #nginx default is 100
#see: http://en.wikipedia.org/wiki/HTTP_persistent_connection

    #tcp_noauth on;
     gzip on; #turned off on 15.10.14 to improve TTFB
#sendfile on;
sendfile       on;
#sendfile_max_chunk 512k; 
tcp_nopush     on;
#seehttps://t37.net/nginx-optimization-understanding-sendfile-tcp_nodelay-and-tcp_nopush.html
    #aio            sendfile;
#directio on;
#aio on;
     #gzip_static off; #turned on 14.10.2014 see:http://www.media-division.com/generation-of-gzip-files-for-nginx/   
      gzip_http_version 1.1;
        gzip_comp_level 1; #was unusefully 9 until 23 mar 2013 (no gain cp wrost)
        gzip_min_length 20;
       # gzip_buffers 16 8k;
       gzip_types "application/x-javascript; charset=utf-8";
       gzip_types text/plain application/xhtml+xml text/css application/xml application/xml+rss text/javascript application/javascript application/x-javascript
        gzip_proxied     any;
        gzip_disable     "MSIE [1-6]\.";
gzip_vary   on;

charset                  utf-8;
  source_charset          utf-8;

#gzip_proxied any;
#gzip_types      text/plain text/html text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;

#limit_req_zone $binary_remote_addr zone=antiddosspider:1m rate=2r/m;
#limit_req_zone $binary_remote_addr zone=antiddosphp:1m rate=3r/s;
#limit_req_zone $binary_remote_addr zone=antiddosstatic:1m rate=30r/s;

#To make an exception, you have to provide empty value for a 
#variable in limit_req_zone (see http://nginx.org/r/limit_req_zone).

#Correct config for exceptions based geo would be (involving 
#intermediate map as geo doesn't allow variables in a resulting 
#value):

    geo $limited {
        default           1;
        120.0.0.1/32  0;
    }

    map $limited $if_nonlocal_binary_remoteaddr {
        1        $binary_remote_addr;
        0        "";
    }

limit_req_zone $if_nonlocal_binary_remoteaddr zone=antiddosspider:1m rate=2r/m;
limit_req_zone $if_nonlocal_binary_remoteaddr zone=antiddosphp:1m rate=7r/s; #was 3r/s until 20 gen 2015
limit_req_zone $if_nonlocal_binary_remoteaddr zone=antiddosstatic:1m rate=500r/s; #was 30r/s until 20 gen 2015

  #  limit_req_zone $limit zone=foo:1m rate=10r/m;
  #  limit_req zone=foo burst=5;

#As you can see from the above config, limit_req_zone now works 
#based on a $limit variable, which is either client address, or an 
#empty string.  In a latter case client isn't limited.

    include /etc/nginx/conf.d/*.conf;
    include /etc/nginx/sites-enabled/*;

#pagespeed
#pagespeed Directives that can only be set globally:
#Global scope in Apache, http block in Nginx.
#DangerPermitFetchFromUnknownHosts

#default fetcher psol serp (threaded, native is evented)
pagespeed FileCachePath /home/pagespeed_cache/others;
pagespeed UseNativeFetcher on; #experimental
resolver 8.8.8.8;

#pagespeed FetchWithGzip on;
#pagespeed FetcherTimeoutMs 30000; #millis = 30s

#InheritVHostConfig
#MessageBufferSize
#NumRewriteThreads
#NumExpensiveRewriteThreads
#UsePerVHostStatistics
#end pagespeed see: https://developers.google.com/speed/docs/mod_pagespeed/configuration

}

# mail {
#     # See sample authentication script at:
#     # http://wiki.nginx.org/NginxImapAuthenticateWithApachePhpScript
# 
#     # auth_http localhost/auth.php;
#     # pop3_capabilities "TOP" "USER";
#     # imap_capabilities "IMAP4rev1" "UIDPLUS";
# 
#     server {
#         listen     localhost:110;
#         protocol   pop3;
#         proxy      on;
#     }
# 
#     server {
#         listen     localhost:143;
#         protocol   imap;
#         proxy      on;
#     }
# }
oschaaf commented 8 years ago

Thanks. The log contains 18 backtraces core dump messages, and they all consistently show this pattern in the log:

2015/12/24 16:42:38 [warn] 27576#0: [ngx_pagespeed 1.10.33.1-7580] [1224/164238:WARNING:resource_fetch.cc(191)] Fetch failed for resource url https://mysite.it/path/to/a/content/replaced.not.always.the.same.and.also.ajax.requests
2015/12/24 16:42:38 [debug] 27576#0: epoll: fd:27 ev:0001 d:00000000035E8CE0
2015/12/24 16:42:38 [debug] 27576#0: pagespeed [00000000038B4460] event: H. bf:00000000037F0BB0 (ps resource) - refcnt:2 - det: N
2015/12/24 16:42:38 [debug] 27576#0: pagespeed [00000000038B4460] ps_base_fetch_handler() returned 404 for H
....
...
2015/12/24 16:42:38 [debug] 27576#0: pagespeed [00000000038B4460] event: D. bf:00000000037F0BB0 (ps resource) - refcnt:1 - det: N
2015/12/24 16:42:38 [alert] 27353#0: worker process 27576 exited on signal 11 (core dumped)
....

I can now consistently reproduce the core dump by setting up a custom 404 location like:

    error_page 404 /custom_404.html;
    location = /custom_404.html {
        root /usr/share/nginx/html;
        internal;
    }

And requesting a pagespeed resource I know doesn't exist:

 curl -vv http://127.0.0.1:8050/mod_pagespeed_test/A.doesnotexist.css.pagespeed.cf.0.css

I think this was triggered by https://github.com/pagespeed/ngx_pagespeed/commit/3efebb79489b2c949f4f0eda2ce97f5824b797f9 ,which forwards handling of error responses (like 404) for .pagespeed. to nginx. At the point we do that, pagespeed has another event for us in-flight, on which we will CHECK fail once we get to it, because someone changed (or zeroed) our request context.

There's also an internal redirect involved in the reproduction case, so I'm going to dig some more to make sure I understand/fix the root cause.

oschaaf commented 8 years ago

@creativeprogramming https://github.com/pagespeed/ngx_pagespeed/pull/1085 should fix this problem. Would you be willing to try it out and confirm? For easier patching, a raw diff: https://github.com/pagespeed/ngx_pagespeed/commit/8bde04461677eea5fd0fa106ef4aa2eecae8c451.diff

liluxdev commented 8 years ago

@oschaaf many thanks, I compiled your (ngx_pagespeed 1.10.33.2-7600/oschaaf-trunk-tracking-internal-redirect-crasher) branch and I'm running it now, it seems all fine, no error, anyway I'll confirm you tomorrow after I'll leave it running some hours with more traffic on the site (now is very low)

liluxdev commented 8 years ago

I confirm, with https://github.com/pagespeed/ngx_pagespeed/pull/1085 all is running fine, no crashes, no errors in log.

oschaaf commented 8 years ago

@creativeprogramming Great, thanks for confirming.

oschaaf commented 8 years ago

Re-opening, we will close this issue when the fix is reviewed and merged. Thanks!

jeffkaufman commented 8 years ago

fix is merged