bigplum / lua-resty-mongol

A ngx_lua driver for mongodb -- Deprecated for not updating with the mongodb version
198 stars 67 forks source link

Uncatchable error occured: '/usr/local/openresty/lualib/resty/mongol/colmt.lua:78: assertion failed!' #4

Closed adamkijak closed 11 years ago

adamkijak commented 12 years ago
Uncatchable error occured: '/usr/local/openresty/lualib/resty/mongol/colmt.lua:78: assertion failed!'

It seems to happen randomly so it's hard to reproduce the error.

Any ideas?

bigplum commented 12 years ago

It seems that database returns a response id different from original request id. I'm not sure if it's about low layer cosocket.

So would you please add some log code and catch the TCP packet from-to mongodb.

  1. vi /usr/local/openresty/lualib/resty/mongol/colmt.lua add ngx.log(ngx.ERR, req_id.."==="..r_res_id.."========="..r_req_id)
 74 local function handle_reply ( conn , req_id , offset_i )
 75     offset_i = offset_i  or 0
 76 
 77     local r_len , r_req_id , r_res_id , opcode = read_msg_header ( conn.sock     )
 78     assert ( req_id == r_res_id )
 79 ngx.log(ngx.ERR, req_id.."==="..r_res_id.."========="..r_req_id)
 80     assert ( opcode == opcodes.REPLY )
  1. sudo tcpdump -X port 27017 > tcp.out and run your test to reproduce the error.

If reproduced, email the error.log and tcp.out to me.

adamkijak commented 12 years ago

If reproduced, email the error.log and tcp.out to me.

I couldn't find your e-mail address, so I put it here: https://gist.github.com/3150587

bigplum commented 12 years ago

Thanks, I check the tcpdump output, the request id is added by 1 before handling response.

I think following modification would be help. Would you try it?

vi colmt.lua

 local id = 0
 local function docmd ( conn , opcode , message ,  reponseTo )
-    id = id + 1
-    local requestID = num_to_le_uint ( id )
+    local req_id = id + 1
+    local requestID = num_to_le_uint ( req_id )
     reponseTo = reponseTo or "\255\255\255\255"
     opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )

     local m = compose_msg ( requestID , reponseTo , opcode , message )
     local sent = assert ( conn.sock:send ( m ) )

-    return id , sent
+    return req_id , sent
 end
adamkijak commented 12 years ago

'assertion failed' has gone but now I get:

2012/07/23 15:04:37 [error] 14722#0: *379 [main] Uncatchable error occured: '/usr/local/openresty/lualib/resty/mongol/colmt.lua:67: timeout' while sending to client,

 51 local function docmd ( conn , opcode , message ,  reponseTo )
 52     --id = id + 1
 53     --local requestID = num_to_le_uint ( id )
 54     local req_id = id + 1
 55     local requestID = num_to_le_uint ( req_id )
 56     reponseTo = reponseTo or "\255\255\255\255"
 57     opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )
 58 
 59     local m = compose_msg ( requestID , reponseTo , opcode , message )
 60     local sent = assert ( conn.sock:send ( m ) )
 61 
 62     --return id , sent
 63     return req_id , sent
 64 end
 65 
 66 local function read_msg_header ( sock )
 67     local header = assert ( sock:receive ( 16 ) )
 68 
 69     local length = le_uint_to_num ( header , 1 , 4 )
 70     local requestID = le_uint_to_num ( header , 5 , 8 )
 71     local reponseTo = le_uint_to_num ( header , 9 , 12 )
 72     local opcode = le_uint_to_num ( header , 13 , 16 )
 73 
 74     return length , requestID , reponseTo , opcode
 75 end
 76 
bigplum commented 12 years ago

It seems the connection is broken between mongo and lua.

Does this issue exist if rollback the modification?

adamkijak commented 12 years ago

Yes.

bigplum commented 12 years ago

So, it's another issue. Please recompile openresty with option: --with-debug, and modify nginx.conf : error_log logs/error.log debug to get the debug log. Then email to bigplum@gmail.com, I'll check it.

adamkijak commented 12 years ago

I've done more debugging and I think I found out where is the problem.

First of all your patch removes incrementation of id so I don't think it's good.

local id = 0
 local function docmd ( conn , opcode , message ,  reponseTo )
-    id = id + 1
-    local requestID = num_to_le_uint ( id )
+    local req_id = id + 1
+    local requestID = num_to_le_uint ( req_id )
     reponseTo = reponseTo or "\255\255\255\255"
     opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )

     local m = compose_msg ( requestID , reponseTo , opcode , message )
     local sent = assert ( conn.sock:send ( m ) )

-    return id , sent
+    return req_id , sent
 end

To reproduce error you have to:

It turns out that the id variable is shared between request and SOMETIMES request id in reply doesn't match (because of race condition I think).

bigplum commented 12 years ago

Hmm, the patch is incompleted, should be modified to:

id = id + 1
local req_id = id

conn.sock:send ( m ) will call yield(), so if there were simultaneous requests, the global var id would be changed. Using req_id to replace id can avoid changing the req_id by another request.

adamkijak commented 12 years ago

OK but what about reading from socket? Let's say we send two requests:

  1. Request with id=1 is sent.
  2. Request with id=2 is sent.
  3. Socket is waiting for response to request with id=1 but response to request with id=2 arrives quicker.

Result: assertion error.

bigplum commented 12 years ago

It would not occured that request with id=1 receive the response of request with id=2. Nginx event model must ensure that response received by corresponding request.

For instance, request1 use local port 20001 connect to mogodb, request2 use local port 20002. Response2 received by port 20002 will trigger the request2's handler, even it is priorer than response1.

adamkijak commented 12 years ago

I think the problem with "assertion failed" is related to "lua_code_cache on". When it's off it works without any problems.

Did you test with it with "lua_code_cache on"?

bigplum commented 12 years ago

If set lua_code_cache off, every request will create a new Lua VM. So the global var would not be keeped and changed.

I've tested with "lua_code_cache on" by ab, following is the result and test code:

Concurrency Level:      1000
Time taken for tests:   28.071 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Total transferred:      17000000 bytes
HTML transferred:       300000 bytes
Requests per second:    3562.43 [#/sec] (mean)
Time per request:       280.707 [ms] (mean)
Time per request:       0.281 [ms] (mean, across all concurrent requests)
Transfer rate:          591.42 [Kbytes/sec] received
local mongo = require "resty.mongol"
conn = mongo:new()
ok, err = conn:connect("10.6.2.51","27019")
if not ok then
    ngx.say("connect failed: "..err)
end

local db = conn:new_db_handle("test")
db:auth("admin","admin")
col = db:get_col("test")

r, err = col:insert(, nil, true)

if not r then
    ngx.status = 400
    ngx.say("not ok")
else
    ngx.say("ok")
end

local ok, err = conn:set_keepalive(0,1000)
if not ok then
    ngx.say("failed to set keepalive: ", err)
    return
end

I will open more client hosts to do performence test.

adamkijak commented 12 years ago

If set lua_code_cache off, every request will create a new Lua VM. So the global var would not be keeped and changed.

Yes, lua_code_cache should work like this, but I checked and id var is changing: colmt.lua

 82     Logger.warn(req_id .. ' == ' .. r_res_id)
 83     assert ( req_id == r_res_id )

2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 1 == 1, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 2 == 2, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 3 == 3, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 4 == 4 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 5 == 5 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 12 == 12, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 13 == 13, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 14 == 14, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 6 == 6 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 7 == 7 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 8 == 8 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 9 == 9 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 10 == 10 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 11 == 11 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 15 == 15, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 16 == 16, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 17 == 17, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 12 == 12 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 13 == 13 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 14 == 14 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 15 == 15 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 16 == 16 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 17 == 17 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 18 == 18, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 19 == 19, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 20 == 20, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 18 == 18 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 19 == 19 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 20 == 20 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:16 [warn] 28311#0: *517878 [main] 21 == 21 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
bigplum commented 12 years ago

It seems the lua vm is not recreated, just reload the code.

adamkijak commented 12 years ago

It seems the lua vm is not recreated, just reload the code.

reload the code? I restart nginx everytime I change nginx.conf.

bigplum commented 12 years ago

I meam the lua code. If lua_code_cache is off, nginx will reload the code, but recreate lua vm.

    /*  load closure factory of script file to the top of lua stack, sp++ */
    rc = ngx_http_lua_clfactory_loadfile(L, (char *) script);

If set lua_code_cache off, every request will create a new Lua VM. So the global var would not be keeped and changed.

It's incorrect.

adamkijak commented 12 years ago

So any ideas how to fix that?

bigplum commented 12 years ago

I've reproduced this issue. It seems related with the ngx-lua module, request 1 called req2 functions to send socket data.

related issue: https://github.com/chaoslawful/lua-nginx-module/issues/150

bigplum commented 12 years ago

This issue maybe relate with global variable. This is my test case, and change the col to local would work.

local mongo = require "resty.mongol"
conn = mongo:new()
ok, err = conn:connect("10.6.2.51", 27000)
if not ok then
    ngx.say("connect failed: "..err)
end

local db = conn:new_db_handle("test")

local col = db:get_col("test")             -----------> the col should be local
if not col then ngx.say("get col failed") end 

ngx.say(col.col)

r,err = col:insert({{name="dog",n=20}},false,true)
if not r then ngx.say("insert failed: "..err) end 

ngx.say(col.col)

r = col:find_one({name="dog"})
ngx.say(r["name"])
adamkijak commented 12 years ago

I use only local variables in my code.

bigplum commented 12 years ago

Could you provide me a test code to reproduce the issue?

bigplum commented 12 years ago

I've pushed a patch to fix the issue about undefined variables. Please fetch the newest package to do the test.