ThePhD / sol2

Sol3 (sol2 v3.0) - a C++ <-> Lua API wrapper with advanced features and top notch performance - is here, and it's great! Documentation:
http://sol2.rtfd.io/
MIT License
4.06k stars 492 forks source link

Accessing containers of usertypes seems very slow #1544

Closed thegabman closed 8 months ago

thegabman commented 8 months ago

Hi,

I am fighting like crazy to get better performance when accessing a container of a usertype. I wrote a little testcase that simplifies and summarizes my findings. (Comparing Lua and C performance)

I tested this on Windows MSVC19 and on MacOS LLVM using the latest version of sol2 and the latest version of luajit. In this test it seems like lua is about 700 times slower that C++. That seems kind of unreal to me and I fear, that I am doing something odd here.

Would be great to hear some opinions on this. Thank You!

Windows measurements in Release mode:

Lua elapsed time: 2.236091
C++ elapsed time: 0.003375

Code:

#include <sol/sol.hpp>
#include <chrono>

struct Vec3
{
    float x = 0.0f;
    float y = 0.0f;
    float z = 0.0f;
};

struct Transform
{
    Vec3 position;
    Vec3 scale;
};

Transform* p_transforms = nullptr;

std::vector<Transform*> GetTransformArray( int32_t count )
{
    std::vector<Transform*> transform_pointers( count );
    for( int i = 0; i < transform_pointers.size(); ++i )
        transform_pointers[ i ] = &p_transforms[ i ];

    return transform_pointers;
}

void Update( std::vector<Transform*> transforms )
{
    for( int i = 0; i < transforms.size(); ++i )
    {
        Transform* p_transform = transforms[ i ];

        p_transform->position.x += 0.01f;
        p_transform->scale.x += 0.01f;
    }
}

int main( int argc, char* argv[] )
{
    int32_t iterations = 1000;
    int32_t count      = 5000;

    p_transforms = new Transform[ count ];

    sol::state lua;
    lua.open_libraries( sol::lib::base, sol::lib::math );

    lua.new_usertype<Transform>( "Transform",
                                 "position", &Transform::position,
                                 "scale", &Transform::scale );

    lua.new_usertype<Vec3>( "Vec3",
                            "x", &Vec3::x,
                            "y", &Vec3::y,
                            "z", &Vec3::z );

    lua.script( R"(
        function Update( transforms )
            for i = 1, #transforms, 1 do
                local transform = transforms[i]

                transform.position.x = transform.position.x + 0.01
                transform.scale.x    = transform.scale.x + 0.01
            end
        end
    )" );

    sol::function update_func = lua[ "Update" ];

    std::vector<Transform*> transforms = GetTransformArray( count );

    // ====================
    // TEST LUA PERFORMANCE
    // ====================

    auto start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
        update_func( transforms );

    auto                          end             = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    double                        elapsed         = elapsed_seconds.count();

    printf( "Lua elapsed time: %f\n", elapsed );

    // ====================
    // TEST C++ PERFORMANCE
    // ====================

    start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
        Update( transforms );

    end             = std::chrono::high_resolution_clock::now();
    elapsed_seconds = end - start;
    elapsed         = elapsed_seconds.count();

    printf( "C++ elapsed time: %f\n", elapsed );

    delete[] p_transforms;
    return 0;
}
ricosolana commented 8 months ago

Is it possible that the Update() function code got compiled away? The test might not be comparing the same things because calling a sol2 function passes by reference (see function calling and forwarding), while the Update() function accepts a copy? Manual new/delete...

thegabman commented 8 months ago

The update function accepts a copy. But a copy of a vector of pointers. So it should not be compiled away. And changing the count changes the timings on both ends linearly. But to be completely shure I am passing a ref to the vector instead and then I print parts of the vector after running Update so it cannot be compiled away. Same result.

thegabman commented 8 months ago

And timings on C++ side look reasonable to me. Its a continuous block of memory. So as cache friendly as it could be.

thegabman commented 8 months ago

By changing the Transform struct from

struct Transform
{
    Vec3 position;
    Vec3 scale;
};

to

struct Transform
{
    float position_x;
    float position_y;
    float position_z;
    float scale_x;
    float scale_y;
    float scale_z;
};

the time spent in lua was cut in half. So my feeling holds true, that it is something about accessing usertypes or tables in general.

Lua elapsed time: 0.962604
C++ elapsed time: 0.003291
thegabman commented 8 months ago

I digged a little further and added a test case that uses plain Lua and Light Userdata. Thats faster, but still not as fast as I expected Lua to be. I also tested Full Userdata which resulted in the same timings like Sol (Cointainer) - as expected.

New Timings:

C++                  elapsed time: 0.002736
Sol (Container)      elapsed time: 0.999166
Lua (Light Userdata) elapsed time: 0.338946

Extended Code:

#define SOL_ALL_SAFETIES_ON  0
#define SOL_USING_CXX_LUAJIT 1
#include <sol/sol.hpp>
#include <chrono>

struct Transform
{
    float position_x;
    float position_y;
    float position_z;
    float scale_x;
    float scale_y;
    float scale_z;
};

Transform* p_transforms = nullptr;

std::vector<Transform*> GetTransformPointerArray( int32_t count )
{
    std::vector<Transform*> transform_pointers( count );
    for( int i = 0; i < transform_pointers.size(); ++i )
        transform_pointers[ i ] = &p_transforms[ i ];

    return transform_pointers;
}

void c_Update( std::vector<Transform*>& transforms )
{
    for( int i = 0; i < transforms.size(); ++i )
    {
        Transform* p_transform = transforms[ i ];

        p_transform->position_x += 0.01f;
        p_transform->scale_x += 0.01f;
    }
}

void c_perf_test( int32_t iterations, int32_t count )
{
    std::vector<Transform*> transform_pointers = GetTransformPointerArray( count );

    auto start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
        c_Update( transform_pointers );

    auto                          end             = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    double                        elapsed         = elapsed_seconds.count();

    printf( "C++                  elapsed time: %f\n", elapsed );
}

void sol_perf_test( int32_t iterations, int32_t count )
{
    sol::state lua;
    lua.open_libraries();

    lua.new_usertype<Transform>( "Transform",
                                 "position_x", &Transform::position_x,
                                 "position_y", &Transform::position_y,
                                 "position_z", &Transform::position_z,
                                 "scale_x", &Transform::scale_x,
                                 "scale_y", &Transform::scale_y,
                                 "scale_z", &Transform::scale_z );

    lua.script( R"(
        function Update( transforms )
            for i = 1, #transforms, 1 do
                local transform = transforms[i]

                local position_x = transform.position_x
                local scale_x    = transform.scale_x

                position_x = position_x + 0.01
                scale_x    = scale_x + 0.01

                transform.position_x = position_x
                transform.scale_x    = scale_x
            end
        end
    )" );

    sol::function update_func = lua[ "Update" ];

    std::vector<Transform*> transform_pointers = GetTransformPointerArray( count );

    auto start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
        update_func( transform_pointers );

    auto                          end             = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    double                        elapsed         = elapsed_seconds.count();

    printf( "Sol (Container)      elapsed time: %f\n", elapsed );
}

static int get_light_transform_array( lua_State* L )
{
    lua_pushlightuserdata( L, p_transforms );
    return 1;
}

static int get_light_transform( lua_State* L )
{
    Transform* p_transforms = (Transform*) lua_touserdata( L, 2 );
    int        index        = luaL_checkint( L, 3 );

    lua_pushlightuserdata( L, &p_transforms[ index - 1 ] );
    return 1;
}

static int get_position_x( lua_State* L )
{
    Transform* p_transform = (Transform*) lua_touserdata( L, 2 );
    lua_pushnumber( L, p_transform->position_x );
    return 1;
}

static int set_position_x( lua_State* L )
{
    Transform* p_transform  = (Transform*) lua_touserdata( L, 2 );
    p_transform->position_x = lua_tonumber( L, 3 );
    return 0;
}

static int get_scale_x( lua_State* L )
{
    Transform* p_transform = (Transform*) lua_touserdata( L, 2 );
    lua_pushnumber( L, p_transform->scale_x );
    return 1;
}

static int set_scale_x( lua_State* L )
{
    Transform* p_transform = (Transform*) lua_touserdata( L, 2 );
    p_transform->scale_x   = lua_tonumber( L, 3 );
    return 0;
}

static void create_transform_library( lua_State* L )
{
    static const struct luaL_Reg transform_library[] = {
        {"GetLightTransformArray", get_light_transform_array},
        {     "GetLightTransform",       get_light_transform},
        {          "GetPositionX",            get_position_x},
        {          "SetPositionX",            set_position_x},
        {             "GetScaleX",               get_scale_x},
        {             "SetScaleX",               set_scale_x},
        {                    NULL,                      NULL}
    };

    luaL_openlib( L, "Transform", transform_library, 0 );
}

void lightuserdata_perf_test( int32_t iterations, int32_t count )
{
    lua_State* p_lua = luaL_newstate();
    luaL_openlibs( p_lua );

    create_transform_library( p_lua );

    int status = luaL_dostring( p_lua, R"(
        function Update( count )
            local transforms = Transform:GetLightTransformArray()

            for i = 1, count, 1 do
                local light_transform = Transform:GetLightTransform( transforms, i )
                local position_x      = Transform:GetPositionX( light_transform )
                local scale_x         = Transform:GetScaleX( light_transform )

                position_x = position_x + 0.01
                scale_x    = scale_x + 0.01

                Transform:SetPositionX( light_transform, position_x )
                Transform:SetScaleX( light_transform, scale_x )
            end
        end
    )" );

    if( status != 0 )
    {
        printf( "Error: %s\n", lua_tostring( p_lua, -1 ) );
        return;
    }

    auto start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
    {
        lua_getglobal( p_lua, "Update" );
        lua_pushinteger( p_lua, count );
        lua_pcall( p_lua, 1, 0, 0 );
    }

    auto                          end             = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    double                        elapsed         = elapsed_seconds.count();

    printf( "Lua (Light Userdata) elapsed time: %f\n", elapsed );

    lua_close( p_lua );
}

int main( int argc, char* argv[] )
{
    int32_t iterations = 1000;
    int32_t count      = 5000;

    p_transforms = new Transform[ count ];
    memset( p_transforms, 0, sizeof( Transform ) * count );

    c_perf_test( iterations, count );
    sol_perf_test( iterations, count );
    lightuserdata_perf_test( iterations, count );

    delete[] p_transforms;
    return 0;
}
ThePhD commented 8 months ago

Every time you access a userdata through sol2, we do a typecheck. Every layer will typecheck: every variable access, every function call. All the time. Every time.

Every time you use a Lightuserdata to do it, it just shunts in a void* and pulls out a void* and calls it a day.

The faster you want, it, the more type checks you can set up and strip out of things. Ultimately, if this is critical for speed, you should likely consider potentially inverting how the control works: put the whole update loop in C++, pull the data out of Lua once, and then operate in C++. It's up to you to design your application the way you'd like.

ThePhD commented 8 months ago

IIRC, how you call it (e.g. update_func( the_pointers ) should not trigger any amount of copying because if we detect you give us a non-const reference to something, we try to just set up a pointer to that thing and not a full-blown copy, so at least that should be (marginally) cheaper, though it will still include type checks.

Not using Vec and using floats directly will save you on a bunch of time because you don't have to go through the Vec typechecks before getting to the float type check.

thegabman commented 8 months ago

Thank you @ThePhD for pointing this out! I switched now to ffi for very performance critical code within lua.

void sol_ffi_perf_test( int32_t iterations, int32_t count )
{
    sol::state lua;
    lua.open_libraries( sol::lib::base, sol::lib::package, sol::lib::jit, sol::lib::ffi );

    lua.script( R"(
        local ffi = require( "ffi" )

        ffi.cdef[[
            typedef struct Transform
            {
                float position_x;
                float position_y;
                float position_z;
                float scale_x;
                float scale_y;
                float scale_z;
            } Transform;
        ]]

        function Update( count, transforms )
            local ffi_transforms = ffi.cast( "Transform*", transforms )

            for i = 0, count-1, 1 do
                local position_x = ffi_transforms[i].position_x
                local scale_x    = ffi_transforms[i].scale_x

                position_x = position_x + 0.01
                scale_x    = scale_x + 0.01

                ffi_transforms[i].position_x = position_x
                ffi_transforms[i].scale_x    = scale_x
            end
        end
    )" );

    sol::function update_func = lua[ "Update" ];

    auto start = std::chrono::high_resolution_clock::now();

    for( int i = 0; i < iterations; ++i )
        update_func( count, (void*) p_transforms );

    auto                          end             = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed_seconds = end - start;
    double                        elapsed         = elapsed_seconds.count();

    printf( "Sol (ffi)            elapsed time: %fs\n", elapsed );
}

That way I have quasi native performance.

C++                  elapsed time: 0.001683s
Sol (Container)      elapsed time: 1.020745s
Lua (Light Userdata) elapsed time: 0.337135s
LuaJit ffi           elapsed time: 0.004741s
Sol (ffi)            elapsed time: 0.004766s