Closed adamkijak closed 11 years ago
It seems that database returns a response id different from original request id. I'm not sure if it's about low layer cosocket.
So would you please add some log code and catch the TCP packet from-to mongodb.
vi /usr/local/openresty/lualib/resty/mongol/colmt.lua
add ngx.log(ngx.ERR, req_id.."==="..r_res_id.."========="..r_req_id)
74 local function handle_reply ( conn , req_id , offset_i )
75 offset_i = offset_i or 0
76
77 local r_len , r_req_id , r_res_id , opcode = read_msg_header ( conn.sock )
78 assert ( req_id == r_res_id )
79 ngx.log(ngx.ERR, req_id.."==="..r_res_id.."========="..r_req_id)
80 assert ( opcode == opcodes.REPLY )
sudo tcpdump -X port 27017 > tcp.out
and run your test to reproduce the error.If reproduced, email the error.log and tcp.out to me.
If reproduced, email the error.log and tcp.out to me.
I couldn't find your e-mail address, so I put it here: https://gist.github.com/3150587
Thanks, I check the tcpdump output, the request id is added by 1 before handling response.
I think following modification would be help. Would you try it?
vi colmt.lua
local id = 0
local function docmd ( conn , opcode , message , reponseTo )
- id = id + 1
- local requestID = num_to_le_uint ( id )
+ local req_id = id + 1
+ local requestID = num_to_le_uint ( req_id )
reponseTo = reponseTo or "\255\255\255\255"
opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )
local m = compose_msg ( requestID , reponseTo , opcode , message )
local sent = assert ( conn.sock:send ( m ) )
- return id , sent
+ return req_id , sent
end
'assertion failed' has gone but now I get:
2012/07/23 15:04:37 [error] 14722#0: *379 [main] Uncatchable error occured: '/usr/local/openresty/lualib/resty/mongol/colmt.lua:67: timeout' while sending to client,
51 local function docmd ( conn , opcode , message , reponseTo )
52 --id = id + 1
53 --local requestID = num_to_le_uint ( id )
54 local req_id = id + 1
55 local requestID = num_to_le_uint ( req_id )
56 reponseTo = reponseTo or "\255\255\255\255"
57 opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )
58
59 local m = compose_msg ( requestID , reponseTo , opcode , message )
60 local sent = assert ( conn.sock:send ( m ) )
61
62 --return id , sent
63 return req_id , sent
64 end
65
66 local function read_msg_header ( sock )
67 local header = assert ( sock:receive ( 16 ) )
68
69 local length = le_uint_to_num ( header , 1 , 4 )
70 local requestID = le_uint_to_num ( header , 5 , 8 )
71 local reponseTo = le_uint_to_num ( header , 9 , 12 )
72 local opcode = le_uint_to_num ( header , 13 , 16 )
73
74 return length , requestID , reponseTo , opcode
75 end
76
It seems the connection is broken between mongo and lua.
Does this issue exist if rollback the modification?
Yes.
So, it's another issue. Please recompile openresty with option: --with-debug, and modify nginx.conf : error_log logs/error.log debug to get the debug log. Then email to bigplum@gmail.com, I'll check it.
I've done more debugging and I think I found out where is the problem.
First of all your patch removes incrementation of id so I don't think it's good.
local id = 0
local function docmd ( conn , opcode , message , reponseTo )
- id = id + 1
- local requestID = num_to_le_uint ( id )
+ local req_id = id + 1
+ local requestID = num_to_le_uint ( req_id )
reponseTo = reponseTo or "\255\255\255\255"
opcode = num_to_le_uint ( assert ( opcodes [ opcode ] ) )
local m = compose_msg ( requestID , reponseTo , opcode , message )
local sent = assert ( conn.sock:send ( m ) )
- return id , sent
+ return req_id , sent
end
To reproduce error you have to:
It turns out that the id variable is shared between request and SOMETIMES request id in reply doesn't match (because of race condition I think).
Hmm, the patch is incompleted, should be modified to:
id = id + 1
local req_id = id
conn.sock:send ( m )
will call yield(), so if there were simultaneous requests, the global var id
would be changed.
Using req_id
to replace id
can avoid changing the req_id
by another request.
OK but what about reading from socket? Let's say we send two requests:
Result: assertion error.
It would not occured that request with id=1 receive the response of request with id=2. Nginx event model must ensure that response received by corresponding request.
For instance, request1 use local port 20001 connect to mogodb, request2 use local port 20002. Response2 received by port 20002 will trigger the request2's handler, even it is priorer than response1.
I think the problem with "assertion failed" is related to "lua_code_cache on". When it's off it works without any problems.
Did you test with it with "lua_code_cache on"?
If set lua_code_cache
off, every request will create a new Lua VM. So the global var would not be keeped and changed.
I've tested with "lua_code_cache on" by ab, following is the result and test code:
Concurrency Level: 1000
Time taken for tests: 28.071 seconds
Complete requests: 100000
Failed requests: 0
Write errors: 0
Total transferred: 17000000 bytes
HTML transferred: 300000 bytes
Requests per second: 3562.43 [#/sec] (mean)
Time per request: 280.707 [ms] (mean)
Time per request: 0.281 [ms] (mean, across all concurrent requests)
Transfer rate: 591.42 [Kbytes/sec] received
local mongo = require "resty.mongol"
conn = mongo:new()
ok, err = conn:connect("10.6.2.51","27019")
if not ok then
ngx.say("connect failed: "..err)
end
local db = conn:new_db_handle("test")
db:auth("admin","admin")
col = db:get_col("test")
r, err = col:insert(, nil, true)
if not r then
ngx.status = 400
ngx.say("not ok")
else
ngx.say("ok")
end
local ok, err = conn:set_keepalive(0,1000)
if not ok then
ngx.say("failed to set keepalive: ", err)
return
end
I will open more client hosts to do performence test.
If set lua_code_cache off, every request will create a new Lua VM. So the global var would not be keeped and changed.
Yes, lua_code_cache should work like this, but I checked and id var is changing: colmt.lua
82 Logger.warn(req_id .. ' == ' .. r_res_id)
83 assert ( req_id == r_res_id )
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 1 == 1, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 2 == 2, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 3 == 3, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 4 == 4 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517878 [main] 5 == 5 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 12 == 12, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 13 == 13, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:13 [warn] 28311#0: *517865 [main] 14 == 14, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 6 == 6 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 7 == 7 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 8 == 8 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 9 == 9 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 10 == 10 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 11 == 11 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 15 == 15, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 16 == 16, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517865 [main] 17 == 17, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 12 == 12 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 13 == 13 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:14 [warn] 28311#0: *517878 [main] 14 == 14 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 15 == 15 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 16 == 16 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 17 == 17 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 18 == 18, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 19 == 19, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517865 [main] 20 == 20, client: XXX, server: , request: "PUT /eurenet/serveur/devis/kaspersky%20devis%20en%20ligne/2011/06.2011/29.06.11 HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 18 == 18 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 19 == 19 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:15 [warn] 28311#0: *517878 [main] 20 == 20 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
2012/08/03 10:08:16 [warn] 28311#0: *517878 [main] 21 == 21 while sending to client, client: XXX, server: , request: "PUT /johan/Photos/2008/2008_12_30%20fin%20d'anne%CC%81e/recup/CONVAR663.jpg HTTP/1.1", host: "XXX"
It seems the lua vm is not recreated, just reload the code.
It seems the lua vm is not recreated, just reload the code.
reload the code? I restart nginx everytime I change nginx.conf.
I meam the lua code. If lua_code_cache is off, nginx will reload the code, but recreate lua vm.
/* load closure factory of script file to the top of lua stack, sp++ */
rc = ngx_http_lua_clfactory_loadfile(L, (char *) script);
If set lua_code_cache off, every request will create a new Lua VM. So the global var would not be keeped and changed.
It's incorrect.
So any ideas how to fix that?
I've reproduced this issue. It seems related with the ngx-lua module, request 1 called req2 functions to send socket data.
related issue: https://github.com/chaoslawful/lua-nginx-module/issues/150
This issue maybe relate with global variable. This is my test case, and change the col to local would work.
local mongo = require "resty.mongol"
conn = mongo:new()
ok, err = conn:connect("10.6.2.51", 27000)
if not ok then
ngx.say("connect failed: "..err)
end
local db = conn:new_db_handle("test")
local col = db:get_col("test") -----------> the col should be local
if not col then ngx.say("get col failed") end
ngx.say(col.col)
r,err = col:insert({{name="dog",n=20}},false,true)
if not r then ngx.say("insert failed: "..err) end
ngx.say(col.col)
r = col:find_one({name="dog"})
ngx.say(r["name"])
I use only local variables in my code.
Could you provide me a test code to reproduce the issue?
I've pushed a patch to fix the issue about undefined variables. Please fetch the newest package to do the test.
It seems to happen randomly so it's hard to reproduce the error.
Any ideas?