Igalia / pflua

Packet filtering in Lua
Other
313 stars 39 forks source link

Error with recent LuaJIT #225

Closed lukego closed 9 years ago

lukego commented 9 years ago

Howdy!

On the Snabb Switch next branch there is a regression in the apps.packet_filter.pcap_filter selftest. This seems to be caused by the combination of upgrading pflua and luajit. Is there perhaps a new bug introduced in LuaJIT that affects pflua?

Here is what happens:

$ sudo ./snabb snsh -t apps.packet_filter.pcap_filter
selftest: pcap_filter
../deps/pflua/src/pf/optimize.lua:78: invalid value (table) at index 4 in table for 'concat'
stack traceback:
        core/main.lua:116: in function <core/main.lua:114>
        [C]: in function 'f'
        ../deps/pflua/src/pf/optimize.lua:78: in function 'cfkey'
        ../deps/pflua/src/pf/optimize.lua:281: in function 'simplify'
        ../deps/pflua/src/pf/optimize.lua:217: in function 'simplify'
        ../deps/pflua/src/pf/optimize.lua:216: in function 'simplify'
        ../deps/pflua/src/pf/optimize.lua:216: in function 'simplify'
        ../deps/pflua/src/pf/optimize.lua:216: in function 'simplify'
        ../deps/pflua/src/pf/optimize.lua:740: in function 'f'
        ../deps/pflua/src/pf/utils.lua:158: in function 'fixpoint'
        ../deps/pflua/src/pf/optimize.lua:749: in function 'optimize'
        ...

and git bisect on luajit says this is the problem commit:

b82fc3d Bump table allocations retroactively if they grow later on.

I have tested with LuaJIT 4da1bb6 that does seem to contain a fix for the commit above, but the problem persists.

lukego commented 9 years ago

Any ideas about what the problem is or how best to report it to Mike/LuaJIT?

dpino commented 9 years ago

@lukego This might be a regression introduced by dd4ba88c (Improve optimizer performance). If you replace the new memoize function for the former one, it works.

I pushed a patch that fixes the issue. Pflua team is not around these days so it might take a while until it lands in master. In the mean time you can revert dd4ba88c or apply this patch manually.

lukego commented 9 years ago

Thanks Diego. I will hold off on upgrading pflua for now. I am curious to know why the problem only happens when we upgrade both pflua and luajit together and whether this points to a bug in LuaJIT.

mpeterv commented 9 years ago

The pflang which causes the issue is

(icmp6 and
 src net 3ffe:501:0:1001::2/128 and
 dst net 3ffe:507:0:1:200:86ff:fe05:8000/116)
or
(ip6 and udp and
 src net 3ffe:500::/28 and
 dst net 3ffe:0501:4819::/64 and
 src portrange 2397-2399 and
 dst port 53)

(https://github.com/SnabbCo/snabbswitch/blob/71064ea9c03c7b8d4fbb0283b3dc553740bf321f/src/apps/packet_filter/pcap_filter.lua#L66-L90)

When compiling it with LuaJIT 2.1.0 at commit b82fc3d I mostly get segfaults, sometimes Lua stack overflows in pf.utils.equals. At commit 4da1bb6 (latest) I almost always get correct output, but sometimes I get one of two errors, sometimes with different stack length:

luajit-2.1.0-alpha: src/pf/ssa.lua:99: attempt to perform arithmetic on a nil value
stack traceback:
    src/pf/ssa.lua:99: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:107: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    ...
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:104: in function 'visit'
    src/pf/ssa.lua:115: in function 'compute_use_counts'
    src/pf/ssa.lua:146: in function 'f'
    src/pf/utils.lua:158: in function 'fixpoint'
    src/pf/ssa.lua:201: in function 'optimize_ssa'
    src/pf/ssa.lua:302: in function 'convert_ssa'
    src/pf.lua:40: in function 'compile_filter'
    tools/pflua-compile:56: in main chunk
    [C]: at 0x0804c170
luajit-2.1.0-alpha: src/pf/ssa.lua:72: bad argument #1 to 'insert' (table expected, got nil)
stack traceback:
    [C]: in function 'insert'
    src/pf/ssa.lua:72: in function 'compile_bool'
    src/pf/ssa.lua:68: in function 'compile_bool'
    src/pf/ssa.lua:68: in function 'compile_bool'
    src/pf/ssa.lua:67: in function 'compile_bool'
    src/pf/ssa.lua:69: in function 'compile_bool'
    src/pf/ssa.lua:73: in function 'compile_bool'
    src/pf/ssa.lua:68: in function 'compile_bool'
    src/pf/ssa.lua:88: in function 'lower'
    src/pf/ssa.lua:302: in function 'convert_ssa'
    src/pf.lua:40: in function 'compile_filter'
    tools/pflua-compile:56: in main chunk
    [C]: at 0x0804c170

When I apply #226 I almost always get the "invalid value (table) at index 4 in table for 'concat'" error, sometimes with different index. Inserting print(parts[i]) after line 93 (https://github.com/Igalia/pflua/blob/fix-issue-225/src/pf/optimize.lua#L93) removes that error. I believe this proves that the issue is in LuaJIT. The SSA errors above still happen very rarely though.

kbara commented 9 years ago

Thank you for the analysis, @mpeterv ! I reproduced the results you mentioned with b82fc3d and 4da1bb6 (though I skipped the print step). The SSA errors above happen quite commonly for me on 4da1bb6; both pflua-match and pflua-compile on the pflang expression you provided have it occur around 50% of the time.

I think the next step is to minimize the test case as much as possible and give it to Mike; do you want to do this, or should I give it a go?

mpeterv commented 9 years ago

@kbara my progress so far. Surprisingly, reducing input table removes the bug somehow. https://gist.github.com/mpeterv/2362928a80d2b92cb17c

mpeterv commented 9 years ago

Looks like the zeros can be moved around sometimes, so that the table could probably be reduced to an array of zeros of some special size.

kbara commented 9 years ago

Curiouser and curiouser; thanks for sharing your progress.

lukego commented 9 years ago

Snabb Switch master is has been using LuaJIT 04dc64b since we added pflua support and we have never seen an issue with that version yet. Can it be a newly introduced bug in LuaJIT v2.1?

mpeterv commented 9 years ago

Reduced the failing case a little bit, the bug shows up as segfault now.

-- The following code segfaults with JIT on.

local function recurse(expr)
   if type(expr) == 'table' then
      local t = {0}
      for i = 1, #expr do
         t[i] = recurse(expr[i])
      end
   end

   return
end

recurse {
   {
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      0, 0, 0, 0, 0, 0, 0,
      {
         { 0 },
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
      },
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
      { 0, 0, 0, 0 }
   }
}

I get a segfault every time on 4da1bb6 and never on d8cfc37, which seems to be the last bug-less commit.

lukego commented 9 years ago

(Sorry, I am mistaken when I say we don't see a problem with the same LuaJIT on Snabb Switch master. That is exactly what is happening on our next branch now.)

dpino commented 9 years ago

I built LuaJIT with debug symbols. This is the conflicting line:

Core was generated by `luajit recurse.lua'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  rec_idx_bump (J=J@entry=0x407ef558, ix=ix@entry=0x7fff82f53a90) at lj_record.c:1171
1171          if (tvisnil(o)) settabV(J->L, o, tpl);

Reverting b82fc3d (plus 60fb3fe and b82fc3d, fixes for b82fc3d), fixes the issue. I tried to investigate more on this but I couldn't figure out anything useful. I think the test case by @mpeterv is good enough to report to Mike, he might know what's going on.

mpeterv commented 9 years ago

I've sent a message to the LuaJIT mailing list.

kbara commented 9 years ago

The bug shown by the test case you provided to the LuaJIT mailing list has been fixed, but pflua-compile still crashes about 10% of the time with the pflang expression you provided, so it looks like there's another bug too.

kbara commented 9 years ago

Both of the above stack traces (ending in compile_bool and visit) still occur.

mpeterv commented 9 years ago

@kbara in fact the second, longer testcase I've sent as a link still fails.

kbara commented 9 years ago

@mpeterv Good point, it'd be good to email the list and point that out. Thank you!

mpeterv commented 9 years ago

@kbara I'll try to reduce the test a bit again and then send it.

kbara commented 9 years ago

458a40b7242aefe2f0893019f0451fb2f2deccd9 fixes the new test case, but pflua-compile keeps having both of the above backtraces, unfortunately.

mpeterv commented 9 years ago

Interesting. As before, the backtraces can be observed for a simple 'tcp' pflang, too, but now much less often - once in 100-200 runs vs. once in 5-10 runs for the original large pflang.

kbara commented 9 years ago

Yes. And when I start with the input to 'lower' (from the icmp6/etc pflang expression above), the backtraces can't be observed at all; thousands of runs complete successfully.

mpeterv commented 9 years ago

It seems that the root of the problem is somewhere in the parser, because if I replace parse.parse("tcp") with {"tcp"} and remove parser import, I can't observe the bug, but if the parser is imported (but even when not used), I get the backtraces.

mpeterv commented 9 years ago

Of course it is simpler than that, nothing wrong with the parser, simply removing the stack slot its local variable occupies gets rid of backtraces.

kbara commented 9 years ago

Indeed. I have a small test file, and it demonstrates the backtraces iff parse is being required, even though I'm not using it at all:

p = require("parse")
ssa = require("ssa")
data = assert(loadstring('anf = ' .. io.open("anf"):read("*a")))()
ssa.lower(anf)

Amusingly, thus far, if I try to minimize parse.lua (for instance, by deleting everything after ~line 500, keeping it syntactically valid), I no longer can reproduce the backtraces.

mpeterv commented 9 years ago

Simply changing length of a name of a local variable can get rid of the backtraces for me (one backtrace in 100-500 runs with one name, no backtraces in at least 10000 runs with a longer name).

kbara commented 9 years ago

I can believe it; it's clearly quite a sensitive bug to trigger. We're getting closer to tracking it down, but the symptoms in the meanwhile are quite interesting.

kbara commented 9 years ago

https://gist.github.com/kbara/5a93c82f83c62515cf84 reproduces the problem ~10% of the time, or a bit more; it uses the anf from the icmp6 pflang, is modified to not need to import the constants file, and is stand-alone. It's still way too big, though.

kbara commented 9 years ago

I halved the number of lines and sent it to the list, though it's still on the wrong side of 1000 lines.

kbara commented 9 years ago

I've managed to create some dumps from the minimized code, by varying -Ohotloop=N. The smallest dump is under 600 lines.

http://www.freelists.org/post/luajit/Segfault-introduced-in-b82fc3d-on-v21-branch,7 and the gist have the details.

kbara commented 9 years ago

This is now fixed in fe565222a1cbf28fbae266da35c8a703fdcfa0dd ; Mike Pall has disabled the optimization for now.

kbara commented 9 years ago

Reopening to unlink from 226, which turns out to be unrelated.