Speed optimizations - Githubissues

cardillan commented 1 year ago

The latest release of Mindcode comes with a new Loop Unrolling optimization!

Loop unrolling is an optimization aimed at speeding up program execution, at the price of increased code size. The description of the Mindcode's implementation of this optimization is available here, while here the handling of the tradeoff between the execution speedup and code size increase is explained.

Implementing loop unrolling took quite some time, as the infrastructure supporting this kind of optimization had to be developed first (namely, multiple optimization passes and data flow analysis). The results are worth the work, I'd say :)

The opportunity for loop unrolling might not always arise, as only loops with fixed number of iterations can be unrolled by Mindcode at this moment. (Please note that list iteration loops are, somewhat unexpectedly, not unrolled by current version of Mindcode, despite their number of iterations being always fixed and known to the compiler. Support for list iteration loops unrolling will be added shortly.) Furthermore, using complex expressions in loop condition and loop control variable updates might make it impossible to the optimizer to recognize loops suitable for unrolling. In the right circumstances, though, the speedups can be significant. One of the good opportunities is clearing up memory banks and memory cells. With loop unrolling, the following code:

for i in 0 ... 10
    cell1[i] = 0
end

compiles to this:

write 0 cell1 0
write 0 cell1 1
write 0 cell1 2
write 0 cell1 3
write 0 cell1 4
write 0 cell1 5
write 0 cell1 6
write 0 cell1 7
write 0 cell1 8
write 0 cell1 9

reducing the number of instructions executed to a third, as the instructions to increase the i variable, as well as the conditional jump terminating the loop itself, are eliminated.

Some examples of speedups I've noticed in my Mindcode codebase:

One of my programs contains a loop that clears the entire memory bank, all 512 entries. The time it takes to clear the memory bank using Logic Processor was clearly noticeable. As the program is not very large overall, there's enough space to unroll the entire loop, reducing the initialization time by two thirds.
I've adapted a beautiful mlogjs Breakout implementation to Mindcode, for compiler benchmarking. There are two nested loops handling bricks drawing/collision testing, and both of these can be unrolled. The speedup is hard to measure exactly, but is discernible just by watching both version of the code running side-by-side.
Included in the Mindcode test suite are a few small programs which are compiled and executed on an emulated processor. The run statistics, including the number of instruction executed while running the task to completion, are logged into a versioned text file, so that the gradual improvement (or deterioration) of the compiled code can be reviewed. While the test programs certainly aren't representative of typical in-game Mindustry Logic programs, they do demonstrate the evolution of the compiler. The files showcasing loop unrolling benefits are here and here (look at the history of these files).

The Loop Unrolling is quite new and while I sought to test it well, issues with the new code are certainly possible. Any feedback about this feature, especially bug reports, are very welcome.

cardillan commented 1 year ago

Today's release added missing list iteration loop unrolling, and automatic Function Inlining. Both of these weren't at all difficult to implement, once the necessary infrastructure was available. Give it a try!

There's a surprising problem with speed optimizations: both loop unrolling and function inlining might sometimes enable further dramatic reductions of code size (see the summing example here). However, when determining the costs of the optimization it is not known how much the code could be reduced after performing the optimization. Let's say that loop unrolling expects to add 400 instructions, but further optimizations reduce that code to 100 instructions. Such an optimization could be performed even if the code size is 900 instructions already, but the optimizer won't known, assuming there isn't enough space.

A possible fix might be to let the optimization coordinator overshoot the available space, rolling back when it doesn't work out. This might increase the optimization complexity exponentially, though. Perhaps a compiler option to specify the target number of instructions for speed optimization could be added, but this just switches the burden of trial and error to the programmer. Perhaps still better than not providing the option at all.

Another approach would be to add compiler directives allowing to disable or force individual speed optimizations. The problem here is that code duplication (function inlining or loop unrolling) might make it difficult to precisely specify which instance of the resulting code the directive applies to.

cardillan commented 1 year ago

The last two optimizations serving the speed goal were added today. They're the Case Switching and Return optimization.

Case Switching is useful for larger case expressions having several (at least 9) integer branches. Case expressions are compiled to a series of conditional jumps that compare the input value against individual alternatives until a match is found. This optimization replaces the conditional jumps with a jump table which transfers the control directly to the target branch, using the input value as an offset into the jump table constructed from jump instructions.

Case Switching will be still improved a bit in the future, although it requires adding the ability to track possible variable values to the Data Flow Optimization, which will - probably - be a lot of work.

The Return optimization isn't that significant and applies only to recursive functions and only in some specific contexts.

The family of speed optimizations is complete by now - I have no ideas for a completely new kind of optimizations. There's still a lot of room for improvements to the existing speed optimizations.

A new instruction-limit compiler option was added, which allows the speed optimizations to exceed the standard limit of 1000 instructions. It will only be usable in cases where other optimizations reduce the code back under 1000 instructions, as mentioned in the previous comment. I've already encountered some cases where this hack allowed to apply an optimization that would otherwise be unavailable.

I'm preparing a benchmark to demonstrate the effect of all the new optimizations on a real-world Mindustry Logic program. Stay tuned :)

cardillan commented 1 year ago

The promised benchmark is here: https://github.com/cardillan/mindcode/discussions/106

cardillan commented 1 year ago

A new idea about another kind of speed optimization (although it will require extension of the Data Flow Analysis): if a variable is known to only attain a limited number of discrete values - which is what the Data Flow Analysis has to determine - it might be possible to split the code that uses that variable to several code paths, one for each possible value, in each of whose the variable would be replaced by the constant value. There will be an overhead of splitting the code path (jumping to the right one on entry and jumping to a common end point on exit); the savings would come from constant folding and constant propagation on individual code paths. Let's call this "Code Path Splitting".

This will probably require to implement at least a crude estimation of probable optimization gains, to avoid applying the optimization on code blocks that wouldn't have lead to any improvement at all (and a costly subsequent rollback). This estimation would help other speed optimizations as well, though.

A similar, and probably easier application of this is to use Data Flow Analysis to identify function calls that have a constant argument. A separate function might be created for each different constant argument value (or combination of constant argument values), saving the overhead of passing the argument(s) into the function and possibly constant folding/propagation in the function body. Furthermore, the original function might be kept for calls that are not covered by a specialized function. Let's call this "Function Specialization".

limonovthesecond2 commented 7 months ago

Another optimization idea. Make the compiler optimize jumps to end. Mindcode often generates jump label-end <comparison>, which can be replaced by jump 0 <comparison> instead. Example:

t = cell1[0]
case t
    when 1 then print("1")
    when 2 then print("2")
    when 3 then print("3")
    else print("other")
end

This code compiles with 3 jump 11 always after each of them jump 0 always (end) is executed.

read t cell1 0
jump 4 notEqual t 1
print "1"
jump 11 always 0 0
jump 7 notEqual t 2
print "2"
jump 11 always 0 0
jump 10 notEqual t 3
print "3"
jump 11 always 0 0
print "other"
end

Currently, to force the compiler to optimize these jumps, we need to use the loop above this code:

while true
    t = cell1[0]
    case t
        when 1 then print("1")
        when 2 then print("2")
        when 3 then print("3")
        else print("other")
    end
end

Compiles:

read t cell1 0
jump 4 notEqual t 1
print "1"
jump 0 always 0 0
jump 7 notEqual t 2
print "2"
jump 0 always 0 0
jump 10 notEqual t 3
print "3"
jump 0 always 0 0
print "other"
jump 0 always 0 0
end

limonovthesecond2 commented 7 months ago

Another optimization idea. Replace and and or expressions with a jump equivalent (not sure if this will work in all cases). Example:

t = cell1[0]
if t == 1 and t == 2
    print("ok")
end

When comparing 2 variables, it takes 4 operation to reach the jump. Moreover, if the first statement if false, the compirason will not stop until everything is calculated.

read t cell1 0
op equal __tmp1 t 1
op equal __tmp2 t 2
op land __tmp3 __tmp1 __tmp2
jump 6 equal __tmp3 false
print "ok"
end

We need to use multiple if expressions to achieve more efficiency:

t = cell1[0]
if t == 1
    if t == 2
        print("ok")
    end
end

Compiles:

read t cell1 0
jump 4 notEqual t 1
jump 4 notEqual t 2
print "ok"
end

The or operator can also be done with something like this:

t = cell1[0]
if t != 1
    if t != 2
        print("not ok")
        // code
        end()
    end
end
print("ok")
end()

limonovthesecond2 commented 7 months ago

Another optimization idea. In loops with expressions such as if-else and case after the expression is completed, the jump forwards to the next command after the comparing. However, we can inject all the code there instead. Example:

i = 0
do
    block = getlink(i)
    case block
        when @unloader then print("unloader")
        when @sorter then print("sorter")
        else print("else")
    end
    i += 1
loop while i < @links

Compiles:

set i 0
getlink block i
jump 5 notEqual block @unloader
print "unloader"
jump 9 always 0 0
jump 8 notEqual block @sorter
print "sorter"
jump 9 always 0 0
print "else"
op add i i 1
jump 1 lessThan i @links
end

This code compiles with 2 jump 9 always, which redirects to the loop condition and adding a counter. Instead of this redirection, we can add code itself at the end of each condition, which will result in larger size but also greater efficiency. To avoid reaching the size limit, something like auto-function-inline can be used with a maximum operation size limit.

i = 0
while true
    block = getlink(i)
    case block
        when @unloader then
            print("unloader")
            i += 1
            if i >= @links
                end()
            end
        when @sorter then
            print("sorter")
            i += 1
            if i >= @links
                end()
            end
        else
            print("else")
            i += 1
            if i >= @links
                end()
            end
    end
end

Compiles:

set i 0
getlink block i
jump 7 notEqual block @unloader
print "unloader"
op add i i 1
jump 1 lessThan i @links
end
jump 12 notEqual block @sorter
print "sorter"
op add i i 1
jump 1 lessThan i @links
end
print "else"
op add i i 1
jump 1 lessThan i @links
end
end

limonovthesecond2 commented 7 months ago

Remark: The last proposal is useful in programs in which there is one large loop, after which there is not much code and any delay is undesirable

cardillan commented 7 months ago

Thanks for all your suggestions! I'll comment on all of them here:

Another optimization idea. Make the compiler optimize jumps to end. Mindcode often generates jump label-end <comparison>, which can be replaced by jump 0 <comparison> instead. (...)

Mindcode actually already does this, when the optimization level is set to aggressive. It is part of the Jump Threading optimization.

Web application by default uses the basic optimization level. The reason for this is two-fold:

Some expensive optimizations (in terms of CPU work done by the optimizer) are only performed on the aggressive level, to improve response times and save resources (the hosting plan I use provides a limited amount of CPU minutes).
Aggressive optimizations alter the program structure more, making it harder to figure out how the compiled code relates to the source code. I assume people use the web application to familiarize themselves with Mindcode, and producing mlog code that doesn't look related to the source code might discourage them. This second reason become a bit obsolete with the introduction of some recent optimizations, such as the Data Flow Optimization, which can alter the program significantly on the basic level as well. It might be a good idea to rethink this entire argument and maybe shift some optimizations from aggressive to basic level.

Another optimization idea. Replace and and or expressions with a jump equivalent (not sure if this will work in all cases). (...)

That would indeed be a significant improvement. A short-circuit evaluation would achieve this, and I've been trying to figure out how to do it for quite some time. This is definitely on my list.

Another optimization idea. In loops with expressions such as if-else and case after the expression is completed, the jump forwards to the next command after the comparing. However, we can inject all the code there instead. (...)

This is something I haven't thought about yet, and it looks like a very useful optimization! I'll have to mull over this for a bit, to figure out how to conceptually handle it. Thanks a lot for this suggestion!

cardillan commented 7 months ago

@limonovthesecond2 I've created a separate issue for the new optimization you've suggested.

cardillan / mindcode

Speed optimizations #102