Closed lukego closed 9 years ago
Any ideas about what the problem is or how best to report it to Mike/LuaJIT?
@lukego This might be a regression introduced by dd4ba88c (Improve optimizer performance). If you replace the new memoize
function for the former one, it works.
I pushed a patch that fixes the issue. Pflua team is not around these days so it might take a while until it lands in master. In the mean time you can revert dd4ba88c or apply this patch manually.
Thanks Diego. I will hold off on upgrading pflua for now. I am curious to know why the problem only happens when we upgrade both pflua and luajit together and whether this points to a bug in LuaJIT.
The pflang which causes the issue is
(icmp6 and
src net 3ffe:501:0:1001::2/128 and
dst net 3ffe:507:0:1:200:86ff:fe05:8000/116)
or
(ip6 and udp and
src net 3ffe:500::/28 and
dst net 3ffe:0501:4819::/64 and
src portrange 2397-2399 and
dst port 53)
When compiling it with LuaJIT 2.1.0 at commit b82fc3d I mostly get segfaults, sometimes Lua stack overflows in pf.utils.equals
. At commit 4da1bb6 (latest) I almost always get correct output, but sometimes I get one of two errors, sometimes with different stack length:
luajit-2.1.0-alpha: src/pf/ssa.lua:99: attempt to perform arithmetic on a nil value
stack traceback:
src/pf/ssa.lua:99: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:107: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
...
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:104: in function 'visit'
src/pf/ssa.lua:115: in function 'compute_use_counts'
src/pf/ssa.lua:146: in function 'f'
src/pf/utils.lua:158: in function 'fixpoint'
src/pf/ssa.lua:201: in function 'optimize_ssa'
src/pf/ssa.lua:302: in function 'convert_ssa'
src/pf.lua:40: in function 'compile_filter'
tools/pflua-compile:56: in main chunk
[C]: at 0x0804c170
luajit-2.1.0-alpha: src/pf/ssa.lua:72: bad argument #1 to 'insert' (table expected, got nil)
stack traceback:
[C]: in function 'insert'
src/pf/ssa.lua:72: in function 'compile_bool'
src/pf/ssa.lua:68: in function 'compile_bool'
src/pf/ssa.lua:68: in function 'compile_bool'
src/pf/ssa.lua:67: in function 'compile_bool'
src/pf/ssa.lua:69: in function 'compile_bool'
src/pf/ssa.lua:73: in function 'compile_bool'
src/pf/ssa.lua:68: in function 'compile_bool'
src/pf/ssa.lua:88: in function 'lower'
src/pf/ssa.lua:302: in function 'convert_ssa'
src/pf.lua:40: in function 'compile_filter'
tools/pflua-compile:56: in main chunk
[C]: at 0x0804c170
When I apply #226 I almost always get the "invalid value (table) at index 4 in table for 'concat'" error, sometimes with different index. Inserting print(parts[i])
after line 93 (https://github.com/Igalia/pflua/blob/fix-issue-225/src/pf/optimize.lua#L93) removes that error. I believe this proves that the issue is in LuaJIT. The SSA errors above still happen very rarely though.
Thank you for the analysis, @mpeterv ! I reproduced the results you mentioned with b82fc3d and 4da1bb6 (though I skipped the print step). The SSA errors above happen quite commonly for me on 4da1bb6; both pflua-match and pflua-compile on the pflang expression you provided have it occur around 50% of the time.
I think the next step is to minimize the test case as much as possible and give it to Mike; do you want to do this, or should I give it a go?
@kbara my progress so far. Surprisingly, reducing input table removes the bug somehow. https://gist.github.com/mpeterv/2362928a80d2b92cb17c
Looks like the zeros can be moved around sometimes, so that the table could probably be reduced to an array of zeros of some special size.
Curiouser and curiouser; thanks for sharing your progress.
Snabb Switch master is has been using LuaJIT 04dc64b since we added pflua support and we have never seen an issue with that version yet. Can it be a newly introduced bug in LuaJIT v2.1?
Reduced the failing case a little bit, the bug shows up as segfault now.
-- The following code segfaults with JIT on.
local function recurse(expr)
if type(expr) == 'table' then
local t = {0}
for i = 1, #expr do
t[i] = recurse(expr[i])
end
end
return
end
recurse {
{
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0,
{
{ 0 },
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
},
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
{ 0, 0, 0, 0 }
}
}
I get a segfault every time on 4da1bb6 and never on d8cfc37, which seems to be the last bug-less commit.
(Sorry, I am mistaken when I say we don't see a problem with the same LuaJIT on Snabb Switch master. That is exactly what is happening on our next
branch now.)
I built LuaJIT with debug symbols. This is the conflicting line:
Core was generated by `luajit recurse.lua'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 rec_idx_bump (J=J@entry=0x407ef558, ix=ix@entry=0x7fff82f53a90) at lj_record.c:1171
1171 if (tvisnil(o)) settabV(J->L, o, tpl);
Reverting b82fc3d (plus 60fb3fe and b82fc3d, fixes for b82fc3d), fixes the issue. I tried to investigate more on this but I couldn't figure out anything useful. I think the test case by @mpeterv is good enough to report to Mike, he might know what's going on.
I've sent a message to the LuaJIT mailing list.
The bug shown by the test case you provided to the LuaJIT mailing list has been fixed, but pflua-compile still crashes about 10% of the time with the pflang expression you provided, so it looks like there's another bug too.
Both of the above stack traces (ending in compile_bool and visit) still occur.
@kbara in fact the second, longer testcase I've sent as a link still fails.
@mpeterv Good point, it'd be good to email the list and point that out. Thank you!
@kbara I'll try to reduce the test a bit again and then send it.
458a40b7242aefe2f0893019f0451fb2f2deccd9 fixes the new test case, but pflua-compile keeps having both of the above backtraces, unfortunately.
Interesting. As before, the backtraces can be observed for a simple 'tcp' pflang, too, but now much less often - once in 100-200 runs vs. once in 5-10 runs for the original large pflang.
Yes. And when I start with the input to 'lower' (from the icmp6/etc pflang expression above), the backtraces can't be observed at all; thousands of runs complete successfully.
It seems that the root of the problem is somewhere in the parser, because if I replace parse.parse("tcp")
with {"tcp"}
and remove parser import, I can't observe the bug, but if the parser is imported (but even when not used), I get the backtraces.
Of course it is simpler than that, nothing wrong with the parser, simply removing the stack slot its local variable occupies gets rid of backtraces.
Indeed. I have a small test file, and it demonstrates the backtraces iff parse is being required, even though I'm not using it at all:
p = require("parse")
ssa = require("ssa")
data = assert(loadstring('anf = ' .. io.open("anf"):read("*a")))()
ssa.lower(anf)
Amusingly, thus far, if I try to minimize parse.lua (for instance, by deleting everything after ~line 500, keeping it syntactically valid), I no longer can reproduce the backtraces.
Simply changing length of a name of a local variable can get rid of the backtraces for me (one backtrace in 100-500 runs with one name, no backtraces in at least 10000 runs with a longer name).
I can believe it; it's clearly quite a sensitive bug to trigger. We're getting closer to tracking it down, but the symptoms in the meanwhile are quite interesting.
https://gist.github.com/kbara/5a93c82f83c62515cf84 reproduces the problem ~10% of the time, or a bit more; it uses the anf from the icmp6 pflang, is modified to not need to import the constants file, and is stand-alone. It's still way too big, though.
I halved the number of lines and sent it to the list, though it's still on the wrong side of 1000 lines.
I've managed to create some dumps from the minimized code, by varying -Ohotloop=N. The smallest dump is under 600 lines.
http://www.freelists.org/post/luajit/Segfault-introduced-in-b82fc3d-on-v21-branch,7 and the gist have the details.
This is now fixed in fe565222a1cbf28fbae266da35c8a703fdcfa0dd ; Mike Pall has disabled the optimization for now.
Reopening to unlink from 226, which turns out to be unrelated.
Howdy!
On the Snabb Switch
next
branch there is a regression in theapps.packet_filter.pcap_filter
selftest. This seems to be caused by the combination of upgrading pflua and luajit. Is there perhaps a new bug introduced in LuaJIT that affects pflua?Here is what happens:
and
git bisect
on luajit says this is the problem commit:I have tested with LuaJIT
4da1bb6
that does seem to contain a fix for the commit above, but the problem persists.