ThePhD / sol2

Sol3 (sol2 v3.0) - a C++ <-> Lua API wrapper with advanced features and top notch performance - is here, and it's great! Documentation:
http://sol2.rtfd.io/
MIT License
4.06k stars 493 forks source link

Bad design or natural sol/lua/C++ limitations? You decide! #1485

Closed vdweller84 closed 1 year ago

vdweller84 commented 1 year ago

Hi all,

Apologies for the sheet of words. I am very very stumped on a major performance bottleneck and would like some input on whether I am a complete moron or I've hit some kind of natural barrier.

I am making a 2D game engine(original!) with objects that run lua scripts when they are created, or every step of the game, etc. I am using luajit.

Instances have their own "built-in" variables like "x", "y", "spriteID", but users can assign new variables to them at runtime. Instances must somehow communicate their "built-in" state to C++ because the engine processes spatial hashing, graphics rendering etc.

I want people to be able to write easy-to-understand code in a lua script. So let's say you have obj_cat (a @ThePhD favorite example) instance that runs an event_step lua script every step. All of the above should be valid:

x = x + 1 --valid: The global table is rigged so that a "current instance"'s variables are edited

me.x = me.x + 1 --same here, "me" is the current instance

--some instance here tries to spawn a kitty. Meow!
kitty = InstanceAddDepth(obj_cat,100,245,0)
SetScope(kitty)
x = x + 1 --kitty's position is changed
y = 54 --same here
ResetScope()
--or: don't set/reset scope and just say kitty.x = kitty.x + 1 etc
y = y + 345 --now we are back to the instance that created the kitty

The issue here is, I made all these thigs work, and it's great. It's also pretty slow.

The first strategy I came up with to implement this functionality was...

Take 1 Make each instance a lua table. Run an init script in the beginning that rigs things like this:

tblNoScope = {__id = 0}
local noScopemt = {
    __index = function(t,key)
        error("Read instance variable: no instance in scope.")
        end,
    __newindex = function(t,key,val)
        error("Modify instance variable: no instance in scope.")
        end
}

setmetatable(tblNoScope, noScope)

me = tblNoScope
global = {}

local propertyFuncSet = {
    x = SetX,
    y = SetY,
    imageIndex = SetImageIndex,
    imageAngle = SetImageAngle,
    imageXscale = SetScaleX,
    imageYscale = SetScaleY,
    imageTint = SetImageTint,
    imageAlpha = SetImageAlpha,
    depth = SetDepth,
    visible = SetVisible,
    spriteID = SetSpriteID,
    maskID = SetMaskID
}

local propertyReadOnly = {
    spriteWidth = false,
    spriteHeight = false,
    spriteXoffset = false,
    spriteYoffset = false,
    objectID = false
}

local propertyReadModify = {
    spriteWidth = function() return __spriteWidth*imageXscale end,
    spriteHeight = function() return __spriteHeight*imageYscale end,
    spriteXoffset = function() return __spriteXoffset*imageXscale end,
    spriteYoffset = function() return __spriteYoffset*imageYscale end
}

-- create private index
local _index = {}
local actualT = {}

-- create metatable
local mt = {
  __index = function (t,k)
    actualT = t[_index] --get actual table through proxy
    if actualT["__deleted"] and k ~= "__id" then
        error("Reading Instance variable: Instance ID " .. tostring(actualT["id"]) .. " doesn't exist")
    end
    local rmf = propertyReadModify[k]
    return (rmf and rmf()) or actualT[k]
  end,
  __newindex = function (t,k,v) 
    actualT = t[_index] --get actual table through proxy
    if actualT["__deleted"] then
        error("Modifying Instance variable: Instance ID " .. tostring(actualT["id"]) .. " doesn't exist")
    end
    local readonly = propertyReadOnly[k]
    if not readonly or readonly() then
        local func = propertyFuncSet[k]
        if func then
            func(actualT.id, v)
        end
        actualT[k] = v
    end
  end
}

function Track(t)
  local proxy = {}
  proxy[_index] = t
  setmetatable(proxy, mt)
  return proxy
end

local globalmt = {
    __index = function(t,key)
        return me[key]
        end,
    __newindex = function(t,key,val)
        me[key] = val
        end
}

setmetatable(_G, globalmt)

So the strategy is the following:

  1. C++ has instance data as well (position, sprite ID, rotation etc) etc. So when a C++ function, say, find k nearest neighbors, needs instance positions, it just reads C++ data, no need to involve lua with this.
  2. Creating an instance: A C++ instance is created. A lua table corresponding to that instance is also created. This table also has some of the C++ instance's variables as keys. The user receives a proxy table rigged with a metatable, so that every instance variable read/write is checked beforehand.
  3. Running events: A "me" global table is assigned the current instance's lua table (the previous table is pushed on a stack, you get the idea). When users modify a variable that must also be updated in C++, a lookup table is consulted. If a C++ function must be called to update that variable too, it is done so. If users just want to read a variable, no problemo, they just read that variable from the lua table. Also read-only/modified read variables are supported through other table lookups. Duplicate state, yay! (because also calling functions to read from C++ data makes things stupidly slow).

So it's mostly tables, with some duplicate data and metatable shenanigans. It works fine, in pronciple! Really! But! On my Intel i5-10300H laptop, 5000 instances moving and shooting bullets, I get ~70 fps. That's... not very good. Half of that runtime is consumed in C++/lua/C++ round trips. Drawing all that stuff is like 3% of the total runtime.

Take 2

Let's try userdata/usertypes! Now all instance info is stored in a C++ class, and I define this usertype:

sol::usertype<XEROOM::Instance> instanceType = lua.new_usertype<XEROOM::Instance>("instance", sol::no_constructor,
    sol::meta_function::new_index, &XEROOM::Instance::Set,
    sol::meta_function::index, &XEROOM::Instance::Get,
    "x", sol::property(&XEROOM::Instance::GetX, &XEROOM::Instance::SetX),
    "y", sol::property(&XEROOM::Instance::GetY, &XEROOM::Instance::SetY),
    "depth", sol::property(&XEROOM::Instance::GetDepth, &XEROOM::Instance::SetDepth),
    "id", sol::readonly(&XEROOM::Instance::ID),
    "layerID", sol::readonly(&XEROOM::Instance::layerID),
    "spriteID", &XEROOM::Instance::sprite,
    "maskID", &XEROOM::Instance::collisionMask,
    "objectID", sol::readonly(&XEROOM::Instance::object),
    "xScale", sol::property(&XEROOM::Instance::GetXScale, &XEROOM::Instance::SetXScale),
    "yScale", sol::property(&XEROOM::Instance::GetYScale, &XEROOM::Instance::SetYScale),
    "rotation", &XEROOM::Instance::rotation,
    "tintRGB", &XEROOM::Instance::tintRGB,
    "alpha", &XEROOM::Instance::alpha,
    "imageIndex", &XEROOM::Instance::imageIndex,
    "imageSpeed", &XEROOM::Instance::imageSpeed
    );

Where Set and Get are:

void Instance::Set(sol::stack_object key, sol::stack_object value) {
this->userData[key] = value;
}

sol::object Instance::Get(sol::stack_object key, sol::this_state L) {
return sol::object(L, sol::in_place, this->userData[key]);
}

The usual scoping methods are applied here too. But instead of tables, C++ sends/receives pointers.

Despite having high hopes for this approach, it runs almost 50% slower than the first method. Well shit. Needless to say, I am stumped for good. So much for implementing bindless graphics/data oriented layouts and what have you.

My question here is whether I am missing something critical here from an architectural standpoint. Have I hit the natural bottom-line barrier between lua/C++ communication, or am I wasting too much time with the way I have designed things? Maybe the speed I get is actually pretty good for a C++/luajit program? Maybe I'm losing it?

One suggestion I wanted to make is whether it is possible to have a "mixed type" sol::property that allows for both member variables and functions, just to avoid calling functions on everything (I don't know, maybe functions are being called anyway behind the scenes). But again, even with this, I wouldn't expect any substantial gains.

Trying tricks like not #definingSOL_ALL_SAFETIES_ON maybe shaves of a few microseconds. I've also heard of separate environments per script, but can't see how they can offer any substantial speed gains.

I would really appreciate some insight on this. I'm willing to try any arcane trick/redesign/non-sentient creature sacrifice to make this work.

Rochet2 commented 1 year ago

I looked at my tests from 2019 and it looks like I was able to do around 2200 calls from C++ to lua every millisecond. The function I ran in lua in the test was the following, where m=5. I was using Lua 5.1. I was not using Luajit, so I expect luajit would give a better result possibly. On the face of it it seems that maybe I have a lot more calls than you do - but I cannot know what your setup is exactly.

local function Fibonacci ( m )
  if m < 2 then
    return m
  end
  return Fibonacci (m -1) + Fibonacci (m -2)
end

Maybe you should make tests where you benchmark how long it takes to call an empty function for example using your setup. Try find a minimal example of what is slow and find out what part of it is slow. In my setup I run the lua scripts once and they define functions. I then call on events that call the specific functions that are alraedy defined in lua. If you re-evaluate your script each time, maybe that is a bottleneck? Hard to know what is going on without code.

And on properties, there is a documented overhead when using any kind of non-function access using sol. You should always use only functions to do things when doing performance critical things. Behind the scenes everything is always a function. See this.

Note that performance for member function calls goes down by a fixed overhead if you also bind variables as well as member functions. This is purely a limitation of the Lua implementation and there is, unfortunately, nothing that can be done about it. If you bind only functions and no variables, however, sol will automatically optimize the Lua runtime and give you the maximum performance possible. Please consider ease of use and maintenance of code before you make everything into functions.

vdweller84 commented 1 year ago

Well, the main performance killer in my case is the following:

Suppose you have a C++ struct:

struct Instance {
  float x;
  float y;
  //blah...
  sol::table table; //this stores most of the struct's variables for use in lua
};

In C++, you store all these instances in some place, eg in a vector or whatever. Your C++ has to use data from these instances for spatial partitioning, rendering etc. It is not viable to just get these values from lua.

Now, at some point, an instance's lua script runs:

x = x + 1

When a value is changed, a table is consulted on whether this value update also has to make changes in C++ data:

local propertyFuncSet = {
    x = SetX,
    y = SetY,
        etc etc
}

(relevant part inside metatable):

local func = propertyFuncSet[k]
if func then
    func(actualT.id, v)
end
So when x is changed, lua also calls `SetX()`, which does a few things in C++:

void InstanceSetX(InstanceID instanceID, PositionType x) {
        Room* room = currentRoomPtr;
        InstanceIndex index = room->mapInstIDIndex[instanceID];
        Float2<PositionType>& cp = room->vecInstPos[index];
        if (cp.x != x) {
            cp.x = x;
            InstanceAddToCollisionUpdateSet(instanceID);
            InstanceAddToSpatialUpdateSet(instanceID);
        }
}

This C++ - lua - C++ round trip causes the most trouble - and this trouble is not alleviated by userdata/usertypes by any stretch - in fact, things are even slower then.

So I guess a question is: Can I do something to make this part much more faster? I am setting these C++ functions up like this: lua["SetX"] = FNCDEF(XEROOM::InstanceSetX); where FNCDEF is #define FNCDEF(x) sol::c_call<sol::wrap<decltype(&x), &x>> (shamelessly stolen)

Is there anything I can do to improve upon this scheme? All these C++ functions that lua calls have 2 arguments (numbers). A stack trick? Something else?

EDIT:: You may be tempted to think that what happens inside the InstanceSetX() function is what hampers speed. But commenting out both InstanceAddToCollisionUpdateSet(instanceID) and InstanceAddToSpatialUpdateSet(instanceID) shows they are largely inconsequential.

vdweller84 commented 1 year ago

A follow-up for the future reader: Using the plain lua C api in many cases helps quite a bit with performance. To give 2 examples:

    int InstanceSetX(lua_State* L) {
        InstanceID instanceID = (InstanceID)lua_tonumber(L, 1);
        PositionType x = (PositionType)lua_tonumber(L, 2);
        lua_pop(L, 2);
        Room* room = currentRoomPtr;
        InstanceIndex index = room->mapInstIDIndex[instanceID];
        Float2<PositionType>& cp = room->vecInstPos[index];
        if (cp.x != x) {
            cp.x = x;
            InstanceAddToCollisionUpdateSet(instanceID);
            InstanceAddToSpatialUpdateSet(instanceID);
        }
        return 0;
    }
    bool RunEventFullHierarchy(std::vector<LuaFuncRef>& vec, InstanceID ID, InstanceIndex index) {
        size_t size = vec.size();
        if (size > 0) {
            lua_State* L = XELUASTATE::lua.lua_state();
            XEROOM::SetScopeInternal(ID, index);
            for (size_t i = 0; i < size; ++i) {
                lua_rawgeti(L, LUA_REGISTRYINDEX, vec[i]);
                lua_pcall(L, 0, 0, LUA_ERRFUNC_STACKINDEX);
            }
            XEROOM::EXPOSED::ResetScope();
            return true;
        }
        return false;
    }

Especially using the lua registry to store references (in the above case functions) can net big gains. So there is space for improvement.