cmajor-lang / cmajor

The Cmajor public repository
https://cmajor.dev
Other
522 stars 31 forks source link

Graphs with 1000+ nodes start to become infeasible #44

Closed MarcusTomlinson closed 5 months ago

MarcusTomlinson commented 5 months ago

In testing graphs of various sizes, I observed more-or-less the following timings (2021 MacBook M1 Pro):

(x-axis = number of nodes, y-axis = number of seconds to compile*)

* or more specifically, the time it takes for engine.link() to return

Screenshot 2024-03-17 at 10 02 36
MarcusTomlinson commented 5 months ago

Test code: https://github.com/MarcusTomlinson/cmajor/blob/test/examples/native_apps/HelloCmajor/Main.cpp

julianstorer commented 5 months ago

Thanks Marcus - we should profile it and see if there's an obvious hotspot.

cesaref commented 5 months ago

What's going on here is that it's duplicating the processor for each of the passthrough instances in the graph, and then compiling this code that many times, and then attempting to inline it all. I'll look through this and see why the duplication is happening - I think we should be able to avoid this.

Interestingly, the resulting optimisation pass is succeeding in turning it into a sensible bit of code. Here's the output for 1000 nodes (100 parallel branches of 10 nodes):


    .build_version macos, 14, 0
    .globl  _initialise                     ; -- Begin function initialise
    .p2align    2
_initialise:                            ; @initialise
; %bb.0:
    ret
                                        ; -- End function
    .globl  _advanceBlock                   ; -- Begin function advanceBlock
    .p2align    2
_advanceBlock:                          ; @advanceBlock
; %bb.0:
    ldr w8, [x0]
    cmp w8, w2
    b.eq    LBB1_3
; %bb.1:                                ; %.lr.ph.preheader
    mov w9, #100
LBB1_2:                                 ; %.lr.ph
                                        ; =>This Inner Loop Header: Depth=1
    add x8, x1, w8, sxtw #2
    ldr w10, [x8]
    mul w10, w10, w9
    str w10, [x8, #4096]
    ldr w8, [x0]
    add w8, w8, #1
    str w8, [x0]
    cmp w8, w2
    b.ne    LBB1_2
LBB1_3:                                 ; %._crit_edge
    str wzr, [x0]
    ret
                                        ; -- End function```

It's managed to work out that this is basically a *100...
MarcusTomlinson commented 5 months ago

Yeah the resulting optimised graph is mad impressive. It's very fast - after waiting an hour for it compile that is ;)

cesaref commented 5 months ago

It turns out it's spending significant time in an O(N^2) algorithm to make names unique, and there's basically lots of similar named functions, one per instance of this processor. So, try dropping this into cmaj_AST_Utilities.h to replace UniqueNameList:

template <typename ObjectType, typename ParentType>
struct UniqueNameList
{
    UniqueNameList() = default;
    UniqueNameList (const UniqueNameList&) = delete;

    std::string getName (const ObjectType& o)
    {
        auto& name = names[std::addressof (o)];

        if (name.empty())
        {
            auto root = static_cast<ParentType&> (*this).getRootName (o);

            if (root.empty())
                root = "_";

            auto exists = [this] (const std::string& nameToCheck) -> bool
            {
                for (auto& n : names)
                    if (n.second == nameToCheck)
                        return true;

                return false;
            };

            auto uniqueName = root;
            auto& suffix = suffixes[root];

            if (suffix != 0)
                uniqueName = root + "_" + std::to_string (suffix++);

            while (exists (uniqueName))
                uniqueName = root + "_" + std::to_string (suffix++);

            name = uniqueName;
        }

        return name;
    }

    void clear()
    {
        names.clear();
    }

    std::unordered_map<const ObjectType*, std::string> names;
    std::unordered_map<std::string, uint32_t> suffixes;
};

This takes the runtime for 4000 nodes (500/8) from 275 secs to 9 secs on my machine, and 10,000 nodes (500/20) with this algorithm takes 65 secs.

I'll see about getting a fix something like this into the codebase.

cesaref commented 5 months ago

I've pushed another performance fix, the 500/8 case is now down to 1.8 secs, and the 10,000 nodes to 9 seconds on my machine. I think that's more reasonable given the size of the graph. The next issue to resolve is to move the duplication of the code for this situation.