Closed Jexocn closed 7 years ago
我测了一下,重现了,初步猜测是内存限制下,共享函数原型那块的代码内存分配失败但没有做检查,然后直接crash了。
估计云大增加的共享proto这块的内存分配都要增加检查?
开启调试符号,挂在了这一行: https://github.com/cloudwu/skynet/blob/master/3rd/lua/lapi.c#L1036
f->p=luaM_newvector(L,n,struct Proto *);
for (i=0; i<n; i++) f->p[i]=NULL;
for (i=0; i<n; i++) {
f->p[i]=cloneproto(L, src->p[i]); // 这里
}
谢谢。这是个很隐晦的 bug ,看看我的新提交,帮忙 review 一下 :)
简单描述一下问题:
写这段代码是考虑了内存分配不足的,但由于没有实际测试过,漏掉了一种情况。
proto 对象需要先关联在结构中,然后才能填写内部数据。否则,在内存分配失败时,lua gc 会尝试跑一遍收集,试图回收掉不用的内存。由于 proto 对象没有事先挂接,刚刚申请的对象就立刻被回收掉了,同时内存又变得够用,内存分配正常返回。而前面分配的对象已经释放,这样 f->p
指针为空。
修改方法是,把 luaF_newproto
调用从 cloneproto
中分离出来,先赋值,再递归调用 cloneproto 。
赞!
我要抓紧学习Lua源码了
测试脚本中的这段代码
for k,v in ipairs(names) do
libs[v] = require(v)
end
改成
for k,v in ipairs(names) do
local ok, m = pcall(require, v)
if ok then
libs[v] = m
end
end
还是会出现 core dump backtrace 如下:
Core was generated by `./skynet examples/config'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000415b08 in propagatemark ()
(gdb) bt
#0 0x0000000000415b08 in propagatemark ()
#1 0x0000000000416571 in singlestep ()
#2 0x0000000000416c38 in luaC_fullgc ()
#3 0x0000000000416cff in luaM_realloc_ ()
#4 0x00000000004117ce in cloneproto ()
#5 0x00000000004118fa in cloneproto ()
#6 0x0000000000411973 in lua_clonefunction ()
#7 0x000000000042330a in luaL_loadfilex ()
#8 0x000000000043063a in searcher_Lua ()
#9 0x0000000000413c2f in luaD_precall ()
#10 0x0000000000414003 in luaD_call ()
#11 0x0000000000414061 in luaD_callnoyield ()
#12 0x0000000000411589 in lua_callk ()
#13 0x000000000042fc7f in findloader ()
#14 0x000000000042fdb0 in ll_require ()
#15 0x0000000000413c2f in luaD_precall ()
#16 0x0000000000414003 in luaD_call ()
#17 0x00000000004116ce in lua_pcallk ()
#18 0x00000000004266f0 in luaB_pcall ()
#19 0x0000000000413c2f in luaD_precall ()
#20 0x000000000041f3ae in luaV_execute ()
#21 0x000000000041400f in luaD_call ()
#22 0x00000000004116ce in lua_pcallk ()
---Type <return> to continue, or q <return> to quit---
#23 0x000000000042661f in luaB_xpcall ()
#24 0x0000000000413c2f in luaD_precall ()
#25 0x000000000041f106 in luaV_execute ()
#26 0x000000000041344c in luaD_rawrunprotected ()
#27 0x00000000004140c0 in lua_resume ()
#28 0x00000000004274e7 in auxresume ()
#29 0x0000000000427817 in luaB_coresume ()
#30 0x0000000000413c2f in luaD_precall ()
#31 0x000000000041f3ae in luaV_execute ()
#32 0x000000000041400f in luaD_call ()
#33 0x0000000000414061 in luaD_callnoyield ()
#34 0x000000000041344c in luaD_rawrunprotected ()
#35 0x00000000004142ad in luaD_pcall ()
#36 0x000000000041164c in lua_pcallk ()
#37 0x00000000004266f0 in luaB_pcall ()
#38 0x0000000000413c2f in luaD_precall ()
#39 0x000000000041f3ae in luaV_execute ()
#40 0x000000000041400f in luaD_call ()
#41 0x0000000000414061 in luaD_callnoyield ()
#42 0x000000000041344c in luaD_rawrunprotected ()
#43 0x00000000004142ad in luaD_pcall ()
#44 0x000000000041164c in lua_pcallk ()
#45 0x00007f147adf5e29 in _cb (context=0x7f1478a28000, ud=0x7f1478a1e008,
---Type <return> to continue, or q <return> to quit---
type=1, session=1, source=0, msg=0x0, sz=0) at lualib-src/lua-skynet.c:50
#46 0x000000000040a038 in dispatch_message (ctx=0x7f1478a28000,
msg=0x7f147b7fae40) at skynet-src/skynet_server.c:259
#47 0x000000000040aad0 in skynet_context_message_dispatch (
sm=sm@entry=0x7f1481615920, q=q@entry=0x7f1478a131c0,
weight=weight@entry=0) at skynet-src/skynet_server.c:313
#48 0x000000000040b1cd in thread_worker (p=<optimized out>)
at skynet-src/skynet_start.c:133
#49 0x00007f14824ed182 in start_thread (arg=0x7f147b7fb700)
at pthread_create.c:312
#50 0x00007f1481b0847d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
这次查了一下,好像不是修改 lua vm 引起的 bug ,而是 lua gc 本身的 bug :)
我先在原版 lua 那里写个 testcase 重现一下。
不好意思,还是我的问题。 由于修改了 proto 结构共享,sizek (常量的个数) 和 sizep (子函数原型的个数) 也被共享了。
原版 lua 是在保证 f->k
和 f->p
分配成功后,才给 f->sizek
以及 f->sizep
赋值的,所以在 gc mark 时,如果是空指针,循环长度也是 0 ,所以不会出错。
而修改版本,f->sp->sizek
和 f->sp->sizep
一定不为 0 ,所以需要多一步检查 f->k
和 f->p
是否为空指针。
将 test/testmemlimit.lua 修改如下,即可出现
core backtrace 如下: