apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.82k stars 804 forks source link

ATS 8.0.3 segmentation fault related to compress plugin #5787

Open esmq2092 opened 5 years ago

esmq2092 commented 5 years ago

i have built ats 8.0.3 with openssl 1.1.1c and brotli-1.0.7,

found that ats crash frequently when the following conditions were both met. 1) https request 2) enable compressi plugin with remove-accept-encoding set to ture or with supported-algorithms set to "gzip,br"


i suppose that some browers advertise support for brotli only through https requests, so problem only happen for https requests.

i also had check with http + brotli,it works fine.

turn on debug logging, nothing useful found~

randall commented 5 years ago

Can you post a stack trace from the crash?

esmq2092 commented 5 years ago

crash too often, but no cores generated...

[Aug 8 10:34:53.359] {0x7f10895d7740} NOTE: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 11: Segmentation fault [Aug 8 10:34:53.359] {0x7f10895d7740} NOTE: [Alarms::signalAlarm] Server Process was reset [Aug 8 10:34:54.362] {0x7f10895d7740} NOTE: [ProxyStateSet] Traffic Server Args: ' -M' [Aug 8 10:34:54.362] {0x7f10895d7740} NOTE: [LocalManager::listenForProxy] Listening on port: 80 (ipv4) [Aug 8 10:34:54.362] {0x7f10895d7740} NOTE: [LocalManager::listenForProxy] Listening on port: 443 (ipv4) [Aug 8 10:34:54.362] {0x7f10895d7740} NOTE: [LocalManager::startProxy] Launching ts process [Aug 8 10:34:54.374] {0x7f10895d7740} NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '14' [Aug 8 10:34:54.374] {0x7f10895d7740} NOTE: [Alarms::signalAlarm] Server Process born

bryancall commented 5 years ago

What does sysctl -a | grep kernel.core on your system show?

jvgutierrez commented 5 years ago

So I can confirm that compress.so is somehow broken and causing segfaults in ATS 8.0.3, we're running 42 instances, 39 healthy and 3 showing frequent segfaults, the main difference is that those 3 instances have compress.so enabled. In our scenario we are running ATS instances that only get plain-text requests (but we do have HTTPS enabled origin servers) and we don't use libbrotli (is not even linked against the compress plugin or traffic_server itself).

We've confirmed that disabling compress.so in the affected instances solve the issue. Here is the compress.so plugin configuration used:

$ sudo cat /etc/trafficserver/compress.config
cache true
remove-accept-encoding true
compressible-content-type *text*
compressible-content-type *json*
compressible-content-type *html*
compressible-content-type *script*
compressible-content-type *xml*
compressible-content-type *icon*
compressible-content-type *ms-fontobject*
compressible-content-type *x-font*
compressible-content-type *sla*

and here is one of the stacktraces we're getting:

#0  0x000000000062064f in Cache::open_write(Continuation*, ats::CryptoHash const*, HTTPInfo*, long, ats::CryptoHash const*, CacheFragType, char const*, int) ()
#1  0x00000000005fd047 in CacheProcessor::open_write(Continuation*, int, HttpCacheKey const*, HTTPHdr*, HTTPInfo*, long, CacheFragType) ()
#2  0x000000000053dc29 in HttpCacheSM::state_cache_open_write(int, void*) ()
#3  0x00000000006aa351 in EThread::process_event(Event*, int) ()
#4  0x00000000006aab60 in EThread::execute_regular() ()
#5  0x00000000006a99a9 in ?? ()
#6  0x00002ad6714054a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#7  0x00002ad672016d0f in clone () from /lib/x86_64-linux-gnu/libc.so.6

We do have core dumps of traffic_server available for further analysis. Anything that we should check that can help debugging this issue?

jvgutierrez commented 5 years ago

After further analysis of one of our coredumps we got the following information:

(gdb) bt
#0  0x0000000000620bef in HTTPInfo::object_key_get (this=<optimized out>, hash=0x2b1f200f5508) at ./proxy/hdrs/HTTP.h:1539
#1  Cache::open_write (this=<optimized out>, cont=0x2b1ef0135880, key=0x2b1ef0135948, info=0x2b1e540fd0e0, apin_in_cache=0, type=CACHE_FRAG_TYPE_HTTP, hostname=0x2b1e90089835 "upload.wikimedia.org",
    host_len=20) at CacheWrite.cc:1767
#2  0x00000000005fd5e7 in CacheProcessor::open_write (this=<optimized out>, cont=0x2b1ef0135948, expected_size=<optimized out>, key=0x0, request=<optimized out>, old_info=0xb5602d7aa8f69200,
    pin_in_cache=<optimized out>, type=<optimized out>) at Cache.cc:3257
#3  0x000000000053e0d9 in HttpCacheSM::open_write (key=0xb5602d7aa8f69200, url=<optimized out>, request=0x0, old_info=0x0, pin_in_cache=0, allow_multiple=false, this=<optimized out>,
    retry=<optimized out>) at HttpCacheSM.cc:343
#4  HttpCacheSM::state_cache_open_write (this=0x2b1ef0135880, event=<optimized out>, data=0x2b1eb806da20) at HttpCacheSM.cc:215
#5  0x00000000006ab461 in Continuation::handleEvent (this=<optimized out>, event=<optimized out>, data=<optimized out>) at ./I_Continuation.h:160
#6  EThread::process_event (this=0x2b1e0e3c6010, e=0x2b1eb806da20, calling_code=2) at UnixEThread.cc:131
#7  0x00000000006abc70 in EThread::execute_regular (this=0x2b1e0e3c6010) at UnixEThread.cc:244
#8  0x00000000006aaab9 in spawn_thread_internal (a=0x22084d0) at Thread.cc:85
#9  0x00002b1e0aec64a4 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#10 0x00002b1e0bad7d0f in clone () from /lib/x86_64-linux-gnu/libc.so.6

The crash happens in ./proxy/hdrs/HTTP.h:1539:

memcpy(pi, m_alt->m_object_key, CRYPTO_HASH_SIZE);
ema commented 5 years ago

On a production host, we have set cache false to cache only one variant of the object instead of both the compressed and decompressed one. With such a configuration, traffic_server has been running without crashes for more than 4 days. It seems thus likely that the crash is due to an interaction between compress.so and alternate handling.

All our dashboards are public, so feel free to check out the ATS dashboards for the above mentioned host to get an idea about the amount of traffic it serves, in case that's of interest.

bryancall commented 3 years ago

@ema Is this still an issue?

ema commented 3 years ago

@ema Is this still an issue?

I don't know! We've disabled the compress plugin altogether in December 2019. Other than causing the segfaults described here, we found that it was introducing up to 30ms TTFB slowdowns, and we do not really need it given that client<->edge compression is done by Varnish in our setup. So yeah, I can't confirm or deny. :-)