apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.81k stars 800 forks source link

Using prewarmed cache of escalate host leading to server restart #8351

Closed VibhorGupta1991 closed 2 years ago

VibhorGupta1991 commented 3 years ago

Hi Guys,

Facing a weird problem when using escalate plugin in my Traffic server version 9.0.2

remap config:

    map / http://a.com @plugin=/usr/local/libexec/trafficserver/tslua.so @pparam=/usr/local/etc/trafficserver/script/L1/main.lua @plugin=/usr/local/libexec/trafficserver/escalate.so @pparam=500:b.com

main.lua

function cache_lookup()
    local cache_status = ts.http.get_cache_lookup_status()
    ts.debug("cache" .. cache_status) 
end

function send_request()
    **ts.server_request.header["Accept-Encoding"] = ""** //Not getting errors if I comment this line of code. 
end

function do_remap()
    ts.hook(TS_LUA_HOOK_CACHE_LOOKUP_COMPLETE, cache_lookup)
    ts.hook(TS_LUA_HOOK_SEND_REQUEST_HDR, send_request)
    return 0
end

Environment : http://a.com/test500.html : 500 (forced) http://b.com/test500.html : 200

Error Scenario:

Step 1) Hit http://b.com/test5xx.html
    Txn Flow: Cache Miss -> Origin[200] -> Cached in ATS
            Output of debug tags:   cache_status 0
Step 2) Hit http://a.com/test5xx.html 
    Txn Flow:  Cache Miss -> Origin [500] -> Escalate -> lookup  http://b.com/test5xx.html [Found in cache]
        -> returns a proper response
                    -> ATS restarted with error highlighted below
Output of debug tags: 
           cache_status 0
           cache_status 2

Server restarts with the following error in monitor.log

[Sep 23 16:07:43.688] traffic_manager NOTE: [LocalManager::pollMgmtProcessServer] **Server Process terminated due to Sig 6: Aborted**
[Sep 23 16:07:43.688] traffic_manager NOTE: [Alarms::signalAlarm] Server Process was reset
[Sep 23 16:07:44.690] traffic_manager NOTE: [ProxyStateSet] Traffic Server Args: ' -M'
[Sep 23 16:07:44.691] traffic_manager NOTE: [LocalManager::listenForProxy] Listening on port: 8080 (ipv4)
[Sep 23 16:07:44.691] traffic_manager NOTE: [LocalManager::listenForProxy] Listening on port: 8080 (ipv6)
[Sep 23 16:07:44.691] traffic_manager NOTE: [LocalManager::startProxy] Launching ts process
[Sep 23 16:07:44.697] traffic_manager NOTE: [LocalManager::pollMgmtProcessServer] New process connecting fd '14'
[Sep 23 16:07:44.697] traffic_manager NOTE: [Alarms::signalAlarm] Server Process born

This works properly if I remove the code inside send request function or if I disable pristine header in records.config.

Not able to find any valid reason for this. It might be something to do with the basics but what I am not able to understand is that this issues is solved just by not using ts.server_request.header variable in send_request function.

Please help if you guys can relate to something you have already seen or faced such problem.

Thanks in advance.

randall commented 3 years ago

@VibhorGupta1991 Any chance there was a core file or a stack trace another log file?

VibhorGupta1991 commented 3 years ago

Hi @randall

Yup attaching the bt full output of the core file.

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0 <repeats 16 times>}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007f72d4fad535 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 15 times>, 140131026296968}}, sa_flags = -938034176, sa_restorer = 0x7f72cc0b6088}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f72d5b2755b in ink_abort (message_format=message_format@entry=0x7f72d5b928a7 "%s:%d: failed assertion `%s`") at ink_error.cc:99
        ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7f72cf0cf960, reg_save_area = 0x7f72cf0cf8a0}}
#3  0x00007f72d5b24b35 in _ink_assert (expression=expression@entry=0x562a12e3d3a0 "sdk_sanity_check_mbuffer(bufp) == TS_SUCCESS", file=file@entry=0x562a12e3f16d "traffic_server/InkAPI.cc",
    line=line@entry=2058) at ink_assert.cc:37
No locals.
#4  0x0000562a12b532b0 in _TSReleaseAssert (line=2058, file=0x562a12e3f16d "traffic_server/InkAPI.cc", text=0x562a12e3d3a0 "sdk_sanity_check_mbuffer(bufp) == TS_SUCCESS") at traffic_server/InkAPI.cc:559
No locals.
#5  TSHandleMLocRelease (bufp=0x7f72c816bc00, parent=<optimized out>, mloc=0x7f72cc0b6088) at traffic_server/InkAPI.cc:2058
        field_handle = <optimized out>
        obj = 0x7f72cc0b6088
#6  0x00007f72cc0253b1 in ?? ()
No symbol table info available.
#7  0x000000000000ea6c in ?? ()
No symbol table info available.
#8  0x000000000000ea6c in ?? ()
No symbol table info available.
#9  0x00007f72c748eb98 in ?? ()
No symbol table info available.
#10 0x00007f72cc025b6a in ?? ()
No symbol table info available.
#11 0x0000562a145e0570 in ?? ()
No symbol table info available.
#12 0x0000562a146a0920 in ?? ()
No symbol table info available.
#13 0x000000000000ea6c in ?? ()
No symbol table info available.
#14 0x00007f72c816b3d0 in ?? ()
No symbol table info available.
#15 0xfffffffffffffff8 in ?? ()
No symbol table info available.
#16 0x0000000000000000 in ?? ()
No symbol table info available.
shukitchan commented 3 years ago

I think the crash is on lua. The TSHttpTxnRedirectUrlSet API used in escalate plugin can sometimes alter the buffer used (in this case for send request, it seems).

And so the lua plugin may fail to clear it in that case.

VibhorGupta1991 commented 3 years ago

Can we do anything about this? @shukitchan

shukitchan commented 3 years ago

I am planning to change this block in plugins/lua/ts_lua_util.c

  case TS_EVENT_HTTP_SEND_REQUEST_HDR:

    lua_getglobal(L, TS_LUA_FUNCTION_SEND_REQUEST);

    if (lua_type(L, -1) == LUA_TFUNCTION) {
      ret = lua_resume(L, 0);
    }

    break;

to


  case TS_EVENT_HTTP_SEND_REQUEST_HDR:

    lua_getglobal(L, TS_LUA_FUNCTION_SEND_REQUEST);

    if (lua_type(L, -1) == LUA_TFUNCTION) {
      ret = lua_resume(L, 0);
    }

    break;

    if (http_ctx->server_request_hdrp != NULL) {
      TSHandleMLocRelease(http_ctx->server_request_bufp, TS_NULL_MLOC, http_ctx->server_request_hdrp);
      http_ctx->server_request_bufp = NULL;
      http_ctx->server_request_hdrp = NULL;
    }

It will be great if you want to try it out. I do hope to have my setup much later on to try that.

shukitchan commented 2 years ago

I can reproduce this and my suggested fix can correct the problem. In general, this is the problem with lua plugin, we kept the TSMLoc/TSMBuffer of client request/client response/server request/server response within a http_ctx and it exists throughout a transaction. But there are things (e.g. the TSHttpTxnRedirectUrlSet API) that can alter them within a transaction. So saving TSMLoc/TSMBuffer and using them later should be avoided.

I will provide a fix/PR for this general lua plugin problem.