Investigation into reported peformance issues

PurpleMyst commented 5 years ago

Hey @Akuli! :man_dancing:

I've been looking into the performance issues you reported today in our IRC channel

I started by running this simple asda example based on the one you posted:

# assumes that x and y are positive, and something else that i'm not sure about
let divisible = (Int x, Int y) -> Bool:
    let x_copy = x    # this is a bug in asdac, this shouldn't be required

    #       x divisible by y
    # <=>   there exists an integer n so that x = n*y
    # <=>   if x = n*y + r, where 0 <= r < y, then r=0
    #
    # trying to find the r
    while TRUE:
        for let r = 0; r != y; r = r+1:
            if x_copy == y + r:
                return (r == 0)

        # exists n so that x = n*y  <=>  exists m so that x-y = m*y
        x_copy = x_copy - y

let and = (Bool a, Bool b) -> Bool:
    if a:
        if b:
            return TRUE
    return FALSE

let fizzbuzz = (Int n) -> Str:
    if (n `divisible` 3) `and` (n `divisible` 5):
        return "Fizzbuzz"
    if n `divisible` 3:
        return "Fizz"
    if n `divisible` 5:
        return "Buzz"
    return n.to_string()

print(fizzbuzz(5))

I utilize this shell script to create a performance graph using gprof2dot and callgrind

#!/usr/bin/env bash

set -exo pipefail

file=asdar/fizzbuzz.asda

pushd ..
python3 -m asdac "$file"
mv "asda-compiled/${file}c" asdar/
popd

rm -v ./*callgrind*

make -j8

valgrind --tool=callgrind ./asdar ./fizzbuzz.asdac
gprof2dot callgrind.out.* --format=callgrind --output=callgrind.dot
dot -Tpng ./callgrind.dot -o graph.png
feh graph.png &

I've not looked too much into the resulting graph due to time constraints, however I have identified one thing: this code-base allocates a lot.

It seems that bytecode parsing is allocation-heavy and the program spends 25% of execution time just parsing. I've identified struct Link to be a linked-list, which allocates a lot and I'm not sure we really need to utilize it.

I will look further into the graph and the code and start creating PRs to improve performance.

One other idea I had is a "peephole" optimizer, i.e. optimizing the bytecode itself by recognizing patterns and introducing specialized opcodes (e.g. jump if not, assign add, ...)

PurpleMyst commented 5 years ago

Update 1

I've converted read_body to utilize a dynamically-sized array instead of a linked list, this avoids many allocations however I've not benchmarked the speed increase.

I'd appreciate if you could run a small benchmark for this :smiley_cat:

Here's the patch for what I've done:

From 58a811e794cb891bf4c8ad0cbca79d058b15824e Mon Sep 17 00:00:00 2001
From: Purple Myst <PurpleMyst@users.noreply.github.com>
Date: Sat, 6 Jul 2019 03:34:20 +0200
Subject: [PATCH] improvement(bcreader): avoid allocations in read_body

from ~250 allocs to ~150 allocs
---
 asdar/src/bcreader.c | 45 ++++++++++++++++++++------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/asdar/src/bcreader.c b/asdar/src/bcreader.c
index 4448078..d7a2074 100644
--- a/asdar/src/bcreader.c
+++ b/asdar/src/bcreader.c
@@ -460,19 +460,16 @@ static bool read_op(struct BcReader *bcr, unsigned char opbyte, struct CodeOp *r
 }

 // this is for a temporary linked list of CodeOps
-struct Link {
-   struct CodeOp op;
-   struct Link *prev;
-};
-
 static bool read_body(struct BcReader *bcr, struct Code *code)
 {
    if (!read_uint16(bcr, &code->nlocalvars))
        return false;

-   struct Link *last = NULL;
    code->nops = 0;

+   struct CodeOp *arr = NULL;
+   size_t capacity = 0;
+
    while(true) {
        unsigned char ob;
        if (!read_opbyte(bcr, &ob))
@@ -487,38 +484,36 @@ static bool read_body(struct BcReader *bcr, struct Code *code)
        if (!read_op(bcr, ob, &val))
            goto error;

-       struct Link *lnk = malloc(sizeof *lnk);
-       if (!lnk) {
-           codeop_destroy(&val);
-           interp_errstr_nomem(bcr->interp);
-           goto error;
+       if (code->nops >= capacity) {
+           capacity = capacity == 0 ? 1 : capacity * 2;
+           arr = realloc(arr, capacity * sizeof *arr);
+
+           if (!arr) {
+               codeop_destroy(&val);
+               interp_errstr_nomem(bcr->interp);
+               goto error;
+           }
        }

-       lnk->op = val;
-       lnk->prev = last;
-       last = lnk;
+       arr[code->nops] = val;
        code->nops++;
    }

-   if(!( code->ops = malloc(sizeof(struct CodeOp) * code->nops) )) {
+   /* shrink arr to fit */
+   arr = realloc(arr, code->nops * sizeof *arr);
+   if(!arr) {
        interp_errstr_nomem(bcr->interp);
        goto error;
    }

-   size_t i = code->nops;
-   for (struct Link *lnk = last, *prev; lnk; lnk = prev) {
-       prev = lnk->prev;
-       code->ops[--i] = lnk->op;
-       free(lnk);
-   }
+   code->ops = arr;
    return true;

 error:
-   for (struct Link *lnk = last, *prev; lnk; lnk = prev) {
-       prev = lnk->prev;
-       codeop_destroy(&lnk->op);
-       free(lnk);
+   for (size_t i = 0; i < code->nops; ++i) {
+       codeop_destroy(&arr[i]);
    }
+   free(arr);
    return false;
 }

Akuli commented 5 years ago

Bytecode reading is not performance critical, because it is only done once on startup and the time spent doing it does not increase if you do fizzbuzz(100) or something instead of fizzbuzz(5). Can you instead optimize fizzbuzz(100)?

Edit: I meant a loop with fizzbuzz called for values 5,6,7,...,100

Akuli commented 5 years ago

I think the problem is likely with the integer operations in divisible. Here are some ideas that might help:

Cache commonly used int objects like 1
Try to avoid creating a new int object for calculating x+y. Instead mutate the x object if its ->refcount indicates that there is nothing else holding a reference to it (integers are immutable in asda, so we need to hide it with a refcount check)
Replace GMP with long long or similar. If performance increases a lot, then the integer objects should use GMP only for numbers that don't fit in long long
CPython has a similar issue with doing many allocations, and they have solved it with a custom arena allocator. Here are some docs I found https://docs.python.org/3/c-api/memory.html

Akuli commented 5 years ago

One other idea I had is a "peephole" optimizer, i.e. optimizing the bytecode itself by recognizing patterns and introducing specialized opcodes (e.g. jump if not, assign add, ...)

This is probably a good idea, but it means more different opcodes to maintain and therefore needs to be benchmarked before merging to master.

When I experimented with implementing yes in asda, I got noticably better results by optimizing the bytecode by hand. I deleted things like !! (boolean negation of topmost object on stack twice) and replacing T! with F (instead of pushing TRUE to stack and negating, just push FALSE). A special "always jump" opcode instead of TJ (push TRUE to stack, then pop an item from stack and jump if its TRUE) could help too.

Akuli commented 5 years ago

The patch looks good anyway, applying :D

It took me quite a while to figure out how to apply it, because it's missing a line with just a space on it at the end. That's correct, it wouldn't apply without adding a space and another newline character at the end of the file, and the error message was a very unhelpful error: corrupt patch at line 83.

Akuli commented 5 years ago

Cache commonly used int objects like 1

I added a cache in b0ddb2e4dced173a824e473208f3b1501870aa11 and a for loop of first 2000 fizzbuzz runs 20% faster now (but I want to make it even faster, of course)

PurpleMyst commented 5 years ago

Replace GMP with long long or similar. If performance increases a lot, then the integer objects should use GMP only for numbers that don't fit in long long

This has been implemented in #7; Read the commit messages for more details. That PR also includes an .editorconfig file (useful! Porcupine should support that) and a change to src/runner.c to have better DEBUG_PRINT logic

PurpleMyst commented 5 years ago

This is probably a good idea, but it means more different opcodes to maintain and therefore needs to be benchmarked before merging to master.

This is something CPython implements and some of the optimizaitons they do are about removing opcodes, even

PurpleMyst commented 5 years ago

Looking into asda_function_cfunc_ret it looks like we allocate for every function call? is that normal? couldn't we just use the same stack?

static Object *asda_function_cfunc_ret(Interp *interp, struct ObjData data, Object *const *args, size_t nargs)
{
    struct Runner rnr;
    switch (run(interp, data.val, &rnr, args, nargs)) {
    // compiler adds a didn't return error to end of returning functions, so RUNNER_DIDNTRETURN can't happen here
    case RUNNER_ERROR:
        return NULL;
    case RUNNER_VALUERETURN:
        return rnr.retval;
    default:
        assert(0);
    }
}

Akuli commented 5 years ago

Porcupine should support that

Yes. Most of its filetype-specific things should be completely replaced with editorconfig files.

Looking into asda_function_cfunc_ret it looks like we allocate for every function call?

Do you mean allocating rnr.stack? I have thought about reusing rnr.stack for different function calls, and I think you have attempted to implement something like that too. If you want to work on it more, that should be a separate issue or pull request IMO. Performance is quite good now :)

Akuli / asda

Investigation into reported peformance issues #6

Update 1