Improving performance of Algorithm Backtrack

devedse commented 5 years ago

I'd like to track progress being made to improve the performance of the Algorithm backtrack.

I've created a new class that implements generics + structs to remove logging statements at compile/JIT to improve performance but I'm definitely looking for more.

Class with work in progress: https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs

redknightlois commented 5 years ago

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

The targets can be stackalloc.
Don't use the variable, use the constant. That way you are not stalling the processor with a Read-after-write hazard. https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs#L87
You are paying for the edge comparison every time as it is a normal condition. check on top if you are on the edge or not, and use an optimized routine where it is not (the most usual). https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs#L87
Dont make target sequential, use it as a 'long' (4 bytes). Each check writes always in the same one, then with a single AVX operation you check for 0 ;)
You can optimize the layout of the data itself, all X together... all Y together and do that with AVX instructions for ultimate l33t :P

Dont sweat, you have at least 6x or so to optimize on that routine :)

devedse commented 5 years ago

@redknightlois , thanks a lot for this :), I don't think I quite understand everything you talked about but I'll try to figure out as much as I can. I'll come back with some questions for you probably soon :). Thanks in advance.

redknightlois commented 5 years ago

Which part you need more detail?

devedse commented 5 years ago

Ill let you know soon once I have some time behind my code again :)).

devedse commented 5 years ago

@redknightlois ,

Baseline for generating a Maze is about 3.81 seconds with the lowest being 3.72. https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs

Currently back in action again :). I played around with your advise a bit:

The targets can be stackalloc.

I changed:

MazePoint[] targets = new MazePoint[4];

to:

Span<MazePoint> targets = stackalloc MazePoint[4];

This actually resulted in Mazes being generated a littlebit slower (generation time 3.92 seconds)

Don't use the variable, use the constant. That way you are not stalling the processor with a Read-after-write hazard.

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean :smile:)

You are paying for the edge comparison every time as it is a normal condition. check on top if you are on the edge or not, and use an optimized routine where it is not (the most usual).

You'd have to check if you're on an edge in every step still right? Now I'll only check if I'm on the left edge if I want to add the left target. If I would do all this beforehand I'd just have to store the results in a bool and then still do an ifcheck if (leftEdge || rightEdge || ....) before I can add the other targets.

Or would you have a better idea?

Dont make target sequential, use it as a 'long' (4 bytes). Each check writes always in the same one, then with a single AVX operation you check for 0 ;)

Do you mean that I bitshift the x and y in a long? Or make a Union out of it?

In the MazePointPos I've indeed set the struct layout as Sequential. But not in the default MazePoint. The reason for MazePointPos to be sequential is to save memory when using it in a big array.

You can optimize the layout of the data itself, all X together... all Y together and do that with AVX instructions for ultimate l33t :P

I'll see if I can do that. Would you have to do something special to get AVG working?

devedse commented 5 years ago

I did some more playing around and tried to store the x - 2 that happens 3 times in a variable. This however didn't improve the times but rather slowed them down by about 5%.

Before:

MazePoint cur = stackje.Peek();
x = cur.X;
y = cur.Y;

int targetCount = 0;
if (x - 2 > 0 && !map[x - 2, y])
{
    targets[targetCount].X = x - 2;
    targets[targetCount].Y = y;
    targetCount++;
}
if (x + 2 < width - 1 && !map[x + 2, y])
{
    targets[targetCount].X = x + 2;
    targets[targetCount].Y = y;
    targetCount++;
}
if (y - 2 > 0 && !map[x, y - 2])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = y - 2;
    targetCount++;
}
if (y + 2 < height - 1 && !map[x, y + 2])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = y + 2;
    targetCount++;
}

After:

MazePoint cur = stackje.Peek();
x = cur.X;
y = cur.Y;

int xLeft = x - 2;
int xRight = x + 2;
int yUp = y - 2;
int yDown = y + 2;

int targetCount = 0;
if (xLeft > 0 && !map[xLeft, y])
{
    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;
    targetCount++;
}
if (xRight < width - 1 && !map[xRight, y])
{
    targets[targetCount].X = xRight;
    targets[targetCount].Y = y;
    targetCount++;
}
if (yUp > 0 && !map[x, yUp])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = yUp;
    targetCount++;
}
if (yDown < height - 1 && !map[x, yDown])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = yDown;
    targetCount++;
}

Anyone has any idea why this doesn't improve generation times?

redknightlois commented 5 years ago

Because you are using 4 extra register, probably the JIT is doing a poorer allocation of registers and end up paying 2 times for the stack store and retrieve. Post both assembler listings and I can tell you for sure.

devedse commented 5 years ago

@redknightlois , wow that's some fast response time ^^. Thanks. If you do have some time, just above here I also posted some additional questions to your first reply.

Let me see if I can get the assembler code.

redknightlois commented 5 years ago

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean 😄)

You can get away without the targetCount at all. This is coupled with another of the comments. Always store the Left on index 0, Right on index 1, Up on index 2 and so on.

If you now use a byte array, you can check for 0 in an unsafe way just reading it as a long and comparing with 0.

redknightlois commented 5 years ago

Try changing:

    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;

into

   ref t = targets[targetCount];
   t.X = xLeft
   t.Y = y

There shouldnt be any difference, but I am not entirely sure if the JIT of 2.0 is able to reuse the t memory location without using the ref.

EDIT: You can also do the same with the map statements... twice per get if you need to keep register use down (because you hit the threshold).

redknightlois commented 5 years ago

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

This is doing exactly this:

    public readonly struct SomeObjectCallback : IAction
    {
        private readonly SomeObject _myObjectRef;

        public SomeObjectCallback(SomeObject objectRef)
        {
            this._myObjectRef = objectRef;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public void Invoke(int step, int total, long x, long y)
        {
            this._myuObjectRef.SomeMethodCallback(step, total, x, y);
        }
    }

devedse commented 5 years ago

I'll wait with my further response untill I've checked out all your advise, I did however create a SharpLab entry if you would want to check out the assembler code emitted: Link

Again, your help is really helpfull, I'd love to learn more about this :)

redknightlois commented 5 years ago

About the ref, if you change the code you will see this:

With usage of ref

    L0174: lea rax, [r15+0x10]
    L0178: mov [rax], ecx
    L017a: mov [rax+0x4], r13d
    L017e: mov dword [rsp+0x54], 0x1

without ref:

    L01dc: movsxd r8, eax
    L01df: mov [r15+r8*8+0x10], ecx
    L01e4: movsxd rcx, eax
    L01e7: mov [r15+rcx*8+0x14], r13d
    L01ec: inc eax
    L01ee: mov [rsp+0x54], eax

See the difference? Pretty difficult to try to get all that information from SharpLab mainly because there is no marker on what part of the code is on each assembler line... hunting for the proper line is hard. This uses far less physical resources which if you saw the talk Scratched Metal is key to high-performance tight loops.

Yet again, that would depend on the targetted JIT and you have biggest offenders, but you can help older JITs with the ref...

devedse commented 5 years ago

Thanks for the elaborate explanation.

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean 😄)

You can get away without the targetCount at all. This is coupled with another of the comments. Always store the Left on index 0, Right on index 1, Up on index 2 and so on.

If you now use a byte array, you can check for 0 in an unsafe way just reading it as a long and comparing with 0.

The reason I need the targetCount is because I need to randomly choose one of the possible valid directions:

MazePoint target = targets[random.Next(targetCount)];

If I don't know which directions are valid ones, I also can't randomly choose a valid direction.

The only thing I can think of is predetermining all valid directions with 4 bools and then simply coding everything out. E.g.:

if (leftValid && topValid && bottomValid && rightValid)
{
    var randomDirection = rnd.Next(4);
    if (randomDirection == 0)
    {
        stackje.Push(new Target(x - 2, y));
    }
    ....
}
else if (leftValid && topValid && !bottomValid && !rightValid)
{
    var randomDirection = rnd.Next(2);
    if (randomDirection == 0)
    {
        stackje.Push(new Target(x - 2, y));
    }
    ....
}
else if (.....)
.....

However this would become a humoungous loop because you'd have to do it for all random scenario's and for all different scenario's where Left and Top are valid for example, but Bottom and Right are not. Because then the rnd.Next(4) should be rnd.Next(2) instead.

Try changing:
    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;
into
   ref t = targets[targetCount];
   t.X = xLeft
   t.Y = y
There shouldnt be any difference, but I am not entirely sure if the JIT of 2.0 is able to reuse the t memory location without using the ref.

EDIT: You can also do the same with the map statements... twice per get if you need to keep register use down (because you hit the threshold).

I'll definitely look into implementing this. Map statements will probably be harder because the map is implemented using a BitArray.

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

This is doing exactly this: ....

Thanks for the advise. In my case I don't really need the performance when an action is provided though. But if I do for some other scenario I'd definitely implement it like this.

Thanks a lot for the help.

devedse commented 5 years ago

This is currently the fastest I managed to get: https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs Takes about 3.6 seconds

I did try to rewrite the whole algorithm to do less if checks: https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs For some reason this code seems to be a tiny bit slower then AlgorithmBacktrack2 though. Probably because the compiled method is now becoming huge? (If I disable inlining for that SetMap method it actually seems to become a tiny bit faster. But still slower then AlgorithmBacktrack2 Takes about 3.7 seconds

redknightlois commented 5 years ago

switch statements are killing you the cost of those is high for tigh loops.

This is a quite clever use of signaling interfaces, good job there ;) https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs#L34

devedse commented 5 years ago

switch statements are killing you the cost of those is high for tigh loops.

This is a quite clever use of signaling interfaces, good job there ;) https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs#L34

Ah I thought switch statements would be better. If I would change this to if checks, would that be better?

I could either write out all cases in an else if or write them with some bit wise checks. E.g. doing a a bitwise check if Left is valid, and if so, then check if also up is valid, etc.etc.

Which one would you suggest?

redknightlois commented 5 years ago

First remove the exceptions, creating exceptions (even if they are not in a try-catch) is a good way to disrupt the code layout, unless you hit one of the optimizations by the JIT for cold code. You will have to look at the assembler generated to know for sure if the code is being moved away.

Mhhh, definitely not switch, but I would reword the whole process instead... the amount of repeated code you are writing (even if cleverly encapsulated on the function) is giving you a hint that the flow is potentially at fault here.

Do you have an example I can run with a known (deterministic) output that I can check in order to avoid making mistakes, I want to try a few things myself which are probably too convoluted to explain from zero.

devedse commented 5 years ago

@redknightlois , I'll try to see what removing the exceptions does for performance, thanks for the tip.

In regards to trying yourself, you can simply clone the application and run the included ConsoleApplication. This piece of code will be executed with a deterministic seed:

        public static void ActualBenchmark2()
        {
            int size = 16384;

            var alg = new AlgorithmBacktrack3();

            Console.WriteLine($"Generating mazes using {alg.GetType().Name}...");

            int seed = 1337;
            while (true)
            {
                var w = Stopwatch.StartNew();

                var innerMapFactory = new InnerMapFactory<BitArreintjeFastInnerMap>(size, size);
                var randomFactory = new RandomFactory<XorShiftRandom>(seed);

                var actionThing = new NoAction();

                var maze = alg.GoGenerate2(innerMapFactory, randomFactory, actionThing);
                w.Stop();
                Console.WriteLine($"Generation time: {w.Elapsed}");
                seed++;

                //var result = MazeVerifier.IsPerfectMaze(maze);
                //Console.WriteLine($"Perfect maze verification time: {w.Elapsed}");
                //Console.WriteLine($"Is our maze perfect?: {result}");
            }
        }

devedse commented 5 years ago

@redknightlois , Removing the Exceptions didn't really seem to do much. However rewriting everything as an if else seemed to improve the speed somewhat: https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack4.cs

redknightlois commented 5 years ago

Still the problem there is a too long method with lots of register spilling.

devedse commented 5 years ago

Yeah I understand. V2 had the shorter method but uses more checks instead so that's what's causing slowness there.

devedse / DeveMazeGeneratorCore

Improving performance of Algorithm Backtrack #2