Exponential increase in compilation time and size with switch cases (+ extractors)

Aurel300 commented 6 years ago

class Test {
  @:keep
  static var rules:Array<Array<Token>->Null<Token>> = [
    s -> switch (s) {
        case _.slice(-1) => [Foo]: Baz("a");
        case _.slice(-2) => [Foo, Baz(a)]: Chuck(a);
        case _.slice(-3) => [Foo, Chuck(a), Chuck(b)]: Terry(a);
        case _.slice(-4) => [Foo, Terry(a), Terry(b), Terry(c)]: Perry(a);
        case _.slice(-5) => [Foo, Perry(a), Perry(b), Perry(c), Perry(d)]: Larry(a);
        case _.slice(-6) => [Foo, Larry(a), Larry(b), Larry(c), Larry(d), Larry(e)]: Maria(a);
        case _.slice(-7) => [Foo, Maria(a), Maria(b), Maria(c), Maria(d), Maria(e), Maria(f)]: Trevor(a);
        case _: null;
      }
  ];
  static function main() {}
}

enum Token {
  Foo;
  Bar(a:String);
  Baz(a:String);
  Chuck(a:String);
  Terry(a:String);
  Perry(a:String);
  Larry(a:String);
  Maria(a:String);
  Trevor(a:String);
}

(also available on try-haxe here)

This script seems to give the compiler a lot of trouble. The above verbatim fails to compile on try-haxe without commenting out the last two cases – probably hitting a time / memory limit. I tested a similar script (where I first encountered this issue) and the time increase with each added case was ~tenfold (2 seconds, 15 seconds, 210 seconds). Tested targets:

neko
js
swf

The generated code easily goes into megabytes. Examining the JS and PHP code, it generates stupidly big if-elseif-else chains, spanning thousands of lines. I assume something similar is happening in the neko output.

Simn commented 5 years ago

I'm not sure if this can be fixed because this is a bit of an anti-pattern. The problem is that the compiler has to assume that for every case that a match failure may still match another case.

If we look at just the first pattern case _.slice(-1) => [Foo]:, there are two ways the matching can fail after the extractor:

The array length can be wrong (array.length != 1).
The element can not match (array[0] != Foo).

For both failures, the compiler has to retry the second case case _.slice(-2) => [Foo, Baz(a)], which now has three ways to fail: The two from the first case plus array[1] != Baz(a). So we end up with something like this:

if(firstCaseLengthMatch) {
    if(firstCaseElement0Match) {
        firstResult;
    } else {
        if(secondCaseLengthMatch) {
            if(secondCaseElement0Match) {
                if(secondCaseElement1Match) {
                    secondResult;
                } else {
                    defaultResult;
                }
            } else {
                defaultResult;
            }
        } else {
            defaultResult;
        }
    }
} else {
    if(secondCaseLengthMatch) {
        if(secondCaseElement0Match) {
            if(secondCaseElement1Match) {
                secondResult;
            } else {
                defaultResult;
            }
        } else {
            defaultResult;
        }
    } else {
        defaultResult;
    }
}

So with just two of these cases, we're already at 8 if branches. This increases to 32,152, 872, 5912 and at that point my computer gives up.

We might be able to improve on this by generating some && instead of nested ifs, but I think the fundamental problem is going to remain. An alternative would be to generate a set of functions instead of a straightforward decision tree, but that comes with its own set of problems and challenges.

Aurel300 commented 5 years ago

You say there are two ways the case can fail:

The array length can be wrong (array.length != 1).

The element can not match (array[0] != Foo).

What I don't understand is why these then lead to separate paths for the remaining cases. Surely if the first case fails, the remaining cases still have to be checked the same way? Why could the switch statement not be essentially the same as this Haxe code:

// var s:Array<Token>;
var temp2:Token;
if ({
    var temp = s.slice(-1);
    temp.length == 1
    && temp[0] == Foo;
  }) Baz("a");
else if ({
    var temp = s.slice(-2);
    temp.length == 2
    && temp[0] == Foo
    && (switch (temp[1]) {
        case Baz(a): temp2 = a; true;
        case _: false;
      })
  }) Chuck(temp2);
// etc

A bit stilted I admit but it would not expand exponentially. Or is there something I'm missing with pattern matching that makes this impossible? Would it be possible to compile the code as the above IF the compiler detects the branches are too numerous?

Simn commented 5 years ago

Yes that's what I mean when I say "generating some &&". But that also only works in situations where we can actually generate an if as opposed to a switch, i.e. if there are multiple array patterns of the same length.

I'll look into it though because it might fix your concrete example.

Simn commented 5 years ago

By simply folding the if expressions I'm now getting this:

Main.rules = [function(s) {
    var _hx_tmp;
    var _hx_tmp1;
    var _hx_tmp2;
    var _hx_tmp3;
    var _hx_tmp4;
    var _hx_tmp5;
    var _hx_tmp6 = s.slice(-1);
    if(_hx_tmp6.length == 1 && _hx_tmp6[0]._hx_index == 0) {
        return Token.Baz("a");
    } else {
        _hx_tmp = s.slice(-2);
        if(_hx_tmp.length == 2 && (_hx_tmp[0]._hx_index == 0 && _hx_tmp[1]._hx_index == 2)) {
            var a = _hx_tmp[1].a;
            return Token.Chuck(a);
        } else {
            _hx_tmp1 = s.slice(-3);
            if(_hx_tmp1.length == 3 && (_hx_tmp1[0]._hx_index == 0 && (_hx_tmp1[1]._hx_index == 3 && _hx_tmp1[2]._hx_index == 3))) {
                var a1 = _hx_tmp1[1].a;
                return Token.Terry(a1);
            } else {
                _hx_tmp2 = s.slice(-4);
                if(_hx_tmp2.length == 4 && (_hx_tmp2[0]._hx_index == 0 && (_hx_tmp2[1]._hx_index == 4 && (_hx_tmp2[2]._hx_index == 4 && _hx_tmp2[3]._hx_index == 4)))) {
                    var a2 = _hx_tmp2[1].a;
                    return Token.Perry(a2);
                } else {
                    _hx_tmp3 = s.slice(-5);
                    if(_hx_tmp3.length == 5 && (_hx_tmp3[0]._hx_index == 0 && (_hx_tmp3[1]._hx_index == 5 && (_hx_tmp3[2]._hx_index == 5 && (_hx_tmp3[3]._hx_index == 5 && _hx_tmp3[4]._hx_index == 5))))) {
                        var a3 = _hx_tmp3[1].a;
                        return Token.Larry(a3);
                    } else {
                        _hx_tmp4 = s.slice(-6);
                        if(_hx_tmp4.length == 6 && (_hx_tmp4[0]._hx_index == 0 && (_hx_tmp4[1]._hx_index == 6 && (_hx_tmp4[2]._hx_index == 6 && (_hx_tmp4[3]._hx_index == 6 && (_hx_tmp4[4]._hx_index == 6 && _hx_tmp4[5]._hx_index == 6)))))) {
                            var a4 = _hx_tmp4[1].a;
                            return Token.Maria(a4);
                        } else {
                            _hx_tmp5 = s.slice(-7);
                            if(_hx_tmp5.length == 7 && (_hx_tmp5[0]._hx_index == 0 && (_hx_tmp5[1]._hx_index == 7 && (_hx_tmp5[2]._hx_index == 7 && (_hx_tmp5[3]._hx_index == 7 && (_hx_tmp5[4]._hx_index == 7 && (_hx_tmp5[5]._hx_index == 7 && _hx_tmp5[6]._hx_index == 7))))))) {
                                var a5 = _hx_tmp5[1].a;
                                return Token.Trevor(a5);
                            } else {
                                return null;
                            }
                        }
                    }
                }
            }
        }
    }
}];

I'm pretty sure there are still cases where you get exponential output, but I struggle to come up with one.

Aurel300 commented 5 years ago

It already looks much better. For context, by the way, I was using a similar switch function in a shift-reduce parser. Maybe not a super common use case for Haxe but FP-like pattern matching seemed a very natural fit.

Simn commented 5 years ago

It's still not a good pattern because slice allocates.

Aurel300 commented 5 years ago

Yep, I understand that. Unfortunately Haxe has no support (yet?) for pattern matching parts of an array, something like

case _:[Foo, Bar(a)]:

Not sure if a cons operator makes sense …

Simn commented 5 years ago

Not sure about cons because that's about, well, cons lists in FP languages. You also couldn't match the end of the list without reversing it first.

But yes, I was also thinking about supporting prefix/suffix matching for Array (and perhaps String), but I can't think of a good syntax for it. Plus I'm not sure if it would fit Haxe.

Aurel300 commented 5 years ago

Perhaps add support for ... as a unary prefix / suffix operator?

case ...[Foo, Bar(a)]:
case [Foo, Bar(a)]...:
case "prefix"...:

I don't know. It would be a nice feature to have, but clearly not crucial. The slice allocations in my example can be avoided with some helper functions probably.

kevinresol commented 5 years ago

Maybe tink_lang is a good reference:

For array, it has Array Rest Pattern: case ['foo', @rest rest, 'bar']:

For string it has regex pattern: case ~/.+@.+\..+/": (but seems can't extract a match) A improved version maybe support extracting matched groups into an array: case ~/([a-z]*)\s(\d*)/ => [letters, digits]:

HaxeFoundation / haxe

Exponential increase in compilation time and size with switch cases (+ extractors) #6731