LuaLanes / lanes

Lanes is a lightweight, native, lazy evaluating multithreading library for Lua 5.1 to 5.4.
Other
437 stars 94 forks source link

Sometimes segmentation fault with Pallene? #214

Closed ewmailing closed 1 year ago

ewmailing commented 1 year ago

I sometimes experience segmentation faults when using LuaLanes. My suspicion is that my problem is that I'm using the in-development work of Pallene, which is using a slightly modified version of Lua 5.4. https://github.com/pallene-lang/pallene https://www.youtube.com/watch?v=pGF2UFG7n6Y

Pallene has the ability to compile Pallene to regular Lua scripts (for testing/debugging/benchmarking), so as an experiment, I did this, and I still encountered crashes sometimes. So I believe it is not the generated Pallene/native code side of things that is the problem, but perhaps simply some code changes they made in the Lua interpreter code base.

To try to further confirm this, installed the regular Lua 5.4.4 and ran my program against that. I was not able to produce any crashes so far. (But as I said, the crashes are only sometimes, so it is hard to be sure.)

My code that uses Lanes is pretty basic. I don't use Lindas. My problem set is embarrassingly parallel. So use Lanes to run each thing in parallel. The results of each lane are saved into a common array for later result processing in serial at the end.

Also, I noticed that if I add calls to lanes.sleep, the longer the duration, the chances of a crash goes down.

Also, exactly only once, I did get a termination that printed this error message: lua: src/keeper.c:251: keepercall_clear: Assertion `FALSE' failed.

I used LuaRocks to install Lanes, version 3.16.0-0. I am running Linux Ubuntu 22.04.1 LTS.

For some context, here is an excerpt of my main LuaLanes body/loop:

local MAX_LANES = 11
local options =
{
}
lanes.configure(options)
local lanes_sleep = lanes.sleep

local function lanes_available(my_lanes_state)
    local max_lanes <const> = my_lanes_state.maxLanes
    local lanes_in_use <const> = my_lanes_state.lanesInUse
    if max_lanes > lanes_in_use then
        return true
    else
        return false
    end
end
local function lanes_in_use(my_lanes_state)
    local lanes_in_use <const> = my_lanes_state.lanesInUse
    if lanes_in_use > 0 then
        return true
    else
        return false
    end
end
local function update_active_run_set(my_lanes_state, table_of_results, table_of_errors)
    local active_run_set = my_lanes_state.activeRunSet

    for k,v in pairs(active_run_set) do
        local k_status = k.status
        if k_status == "running"
            or k_status == "pending"
        then
            -- do nothing

        elseif k_status == "done" then
            -- process finished
            -- My function returns 3 values: boolean, table, string
            local did_succeed = k[1]
            if did_succeed then
                local entry_result = k[2]
                table_of_results[#table_of_results+1] = entry_result
            else
                local err_msg = k[3]
                table_of_errors[#table_of_errors+1] = err_msg
            end
            -- remove from list
            active_run_set[k] = nil
            my_lanes_state.lanesInUse = my_lanes_state.lanesInUse - 1

        elseif k_status == "error" then
            print("Lanes error hit")
            print("Lanes error: ", k[1])
            active_run_set[k] = nil
            my_lanes_state.lanesInUse = my_lanes_state.lanesInUse - 1
        else
            print("didn't expect to get here", k_status)
        end
    end

end

function main(list_of_items)
    local table_of_results = {}
    local failed_list = {}
    local my_lanes_state =
    {
        activeRunSet = {},
        lanesInUse = 0,
        maxLanes = MAX_LANES,
    }
    local l_scanner_script_run = lanes.gen("*", scanner_script.run)

    for k,v in pairs(list_of_items) do
            local is_dispatched = false

            while not is_dispatched do
                update_active_run_set(my_lanes_state, table_of_results, failed_list)

                if lanes_available(my_lanes_state) then

                    local proc = l_scanner_script_run(params)
                    my_lanes_state.activeRunSet[proc] = proc
                    my_lanes_state.lanesInUse = my_lanes_state.lanesInUse + 1

                    is_dispatched = true

                else
                    -- waiting for lanes to become available
                    --lanes_sleep(.1)
                end
            end

    end

    -- let any remaining lanes finish running
    while lanes_in_use(my_lanes_state) do
        update_active_run_set(my_lanes_state, table_of_results, failed_list)
        --lanes_sleep(.1)
    end

end

-- To run, program calls:
-- main(list_of_items)

I also manually ran the tests I found in the Lanes repo against both versions of Lua that I had. for f in ./*.lua; do lua5.4 "$f" >> testrun54.txt 2>&1; done for f in ./*.lua; do lua "$f" >> testrunpal.txt 2>&1; done I visually compared the outputs with a diff tool, and didn't see anything that stood out (or any seg faults). But I'm attaching the results of both runs. testrun54.txt testrunpal.txt

Do you have any recommendations on how I can help isolate the problem so it can be fixed? Since my only synchronization point is where I look for when a Lane is finished, and put the results into an array, I'm hoping maybe there are some obvious places to look that need some kind of lock or something in the C code base, perhaps because some Lua internal got moved that Lanes was expecting.

Thank you

benoit-germain commented 1 year ago

This assert you mention in keeper.c is something that should never happen. It means that somehow the stack manipulation logic of the function is flawed, and the stack contents at the end of the function are not the same as what we started with. This is definitely not the case in the function itself, so the cause must be external. A wrong multithread access issue comes to mind. However, I've checked the code, as far as I can tell all accesses to the keeper state associated to a given linda are properly protected by the keeper's mutex, so I'm somewhat baffled.

Have you tried running your tests against a debug Pallene build? Maybe there are internal checks that you can activate, such as LUA_USE_API_CHECK or similar?

ewmailing commented 1 year ago

Thank you for the response. I still don't know for sure if LuaLanes is the problem or not. You are right that it could be Pallene or something else. (I'm currently also deeply scrutinizing a C-module I'm using for possible thread-unsafe things. Although I can't explain why I haven't seen it crash under stock Lua 5.4, except for bad luck.)

FYI, last year we did use LUAI_ASSERT and caught a pretty devious Pallene bug with its code-generator that only happened in certain edge cases. But right now, it isn't catching anything.

Anyway, I appreciate you checking the code.