segfault in nginx under high concurrency

buddy-ekb commented 10 years ago

I suspect that there is some bug in ngx_postgres' connection cache code. While stressing the nginx with "ab -c 20 ..." it occasionally segfaulted, so I ran it under valgrind. There is the result:

==25581== Invalid read of size 8 ==25581== at 0x406DAB: ngx_destroy_pool (ngx_palloc.c:76) ==25581== by 0x4B6457: ngx_postgres_upstream_free_connection (ngx_postgres_upstream.c:584) ==25581== by 0x4B00FE: ngx_postgres_keepalive_free_peer (ngx_postgres_keepalive.c:218) ==25581== by 0x4B60DC: ngx_postgres_upstream_free_peer (ngx_postgres_upstream.c:509) ==25581== by 0x4B6674: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:79) ==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507) ==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488) ==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366) ==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76) ==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314) ==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976) ==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691) ==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248) ==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315) ==25581== by 0x403DF4: main (nginx.c:404) ==25581== Address 0x5641930 is 96 bytes inside a block of size 256 free'd ==25581== at 0x4A063F0: free (vg_replace_malloc.c:446) ==25581== by 0x406E39: ngx_destroy_pool (ngx_palloc.c:87) ==25581== by 0x44E8D2: ngx_http_close_connection (ngx_http_request.c:3489) ==25581== by 0x44E56D: ngx_http_close_request (ngx_http_request.c:3350) ==25581== by 0x44E0C1: ngx_http_lingering_close_handler (ngx_http_request.c:3209) ==25581== by 0x44DF6B: ngx_http_set_lingering_close (ngx_http_request.c:3171) ==25581== by 0x44C858: ngx_http_finalize_connection (ngx_http_request.c:2493) ==25581== by 0x44C443: ngx_http_finalize_request (ngx_http_request.c:2384) ==25581== by 0x4B6800: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:140) ==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507) ==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488) ==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366) ==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76) ==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314) ==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976) ==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691) ==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248) ==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315) ==25581== by 0x403DF4: main (nginx.c:404)

I believe that "ngx_postgres_keepalive_free_peer" (ngx_postgres_keepalive.c:164) somehow fails to correctly isolate some nginx memory pool from freeing which occurs later at (ngx_postgres_util.c:140) and eventually leads to an attempt of reading invalid data from freed and reused memory area.

I'm looking forward to giving any comprehensive information to anybody whom it may concern.

agentzh commented 10 years ago

@buddy-ekb Thank you for the report! We'll look into this. BTW, what versions of nginx and ngx_postgres are you using?

PiotrSikora commented 10 years ago

@buddy-ekb: are you running it under Valgrind with master_procces off? If so, that's a known issue (and the reason 1.0 isn't released yet).

I'd be interested in seeing coredump from the segfaults, tho.

buddy-ekb commented 10 years ago

@agentzh Nginx version is 1.5.7, the binary is for linux x86_64 compiled with ngx_postgres 1.0rc3 and linked against postgresql 9.3.1 client library from pgdg rpm repository.

@PiotrSikora I put necessary files here https://www.dropbox.com/sh/m19ozujak1jucup/_9U4V5JXgY

Of course, I was running nginx under valgrind with master_procces off, but indeed it dumped the provided core without valgrind and had been dumping cores without master_process off as well, so it is very likely that the problem is not only in valgrind.

agentzh commented 10 years ago

@buddy-ekb Thank you for the info! I'll try to look into this issue in the coming days :)

PiotrSikora commented 10 years ago

@buddy-ekb: the core dump you provided is from nginx running with master_process off, could you please provide one from nginx that runs normally? I tend to think that the core dumps in productions are different than this one... Thanks!

buddy-ekb commented 10 years ago

@PiotrSikora here you are https://www.dropbox.com/sh/krmghdszj28bkkv/vc1uCpMsIu

A few remarks on them:

All the cores dumped with master_process on;
The only difference between core.19220 and core.19339 is daemon on in the latter case;
core.19361 dumped after signalling to the master process to quit (the same master as for core.19339);
I had to rebuild nginx binary from sources since I had made several additions in ngx_postgres-1.0rc3/src/ngx_postgres_upstream.c (tried to insert some logging in hope to catch the bug quickly, but was out of luck). Now the line numbers are in sync with vanilla sources of nginx and ngx_postgres.

agentzh commented 10 years ago

@buddy-ekb Sorry for the delay on my side! I've just committed a fix for your problem to git master. Will you try it out on your side? Thank you!

buddy-ekb commented 10 years ago

With the patch the stress testing has been passing without a hitch. Thank you very much, @agentzh !

agentzh commented 10 years ago

@buddy-ekb BTW, I've noticed that you've enabled debug logging via the --with-debug option (and this segfault only happened with debug logging enabled). If you are serious about performance, better disable this in your nginx (or openresty) build.

buddy-ekb commented 10 years ago

@agentzh, thank you for the remark!

Though I consider this performance penalty cost nothing against transaction handling inside PostgreSQL even in case of simpe "SELECT 1". As a matter of fact, I did some benchmarks on localhost "debug vs. nodebug", "ipv4 vs. unix sockets", "ipv4 vs. ipv6" and it seems to me there is no significant difference between these options.

FRiCKLE / ngx_postgres

segfault in nginx under high concurrency #21