Code generation optimizations

erkyrath commented 3 years ago

(As suggested in this thread: https://intfiction.org/t/next-steps-for-inform-6-compiler/51008)

For lines like

if (false) { return 97; }

...the compiler generates an unconditional jump, but then generates the return opcode anyway.

    3  +00011 <*> jump         to L0 
    4  +00017 <*> return       byte_97 
    5  +0001a    .L0

Ideally, the author would use #ifdef instead of if for this sort of code. However, I7 generates trivial conditions sometimes. E.g.

Understand "count [number]" as counting.

...generates the stanza:

    if (v == GPR_NUMBER) {
        n = noun; noun = parsed_number;
        if (~~(((true) && (true)))) v = GPR_FAIL;
        noun = n;
    }

The compiler successfully collapses (~~(((true) && (true)))) to (false) but still generates the v = GPR_FAIL; assignment.

This can almost be addressed with the execution_never_reaches_here flag, except that that flag is currently used to generate the "statement can never be reached" warning, which is not exactly the same thing. (See assembleg_1_branch(), which turns off that flag to avoid generating a warning on "if (false)..." because the author probably did it on purpose.) (I7 didn't do it on purpose, but I7 suppresses all warnings anyhow.)

I suspect the fix here is for execution_never_reaches_here to have three states: false, true, and "true but the author did it on purpose". If non-false, we can skip generating code entirely in assembleg_instruction() / assemblez_instruction().

(Is it safe to just bail out of those functions? What about inline strings in Z-code? Sequence points? Symbols used only in dead code? So many corner cases to check!)

Note that the compiler assumes that every jump label is live code, even if no jump statement goes there. (See assemble_label_no().) We should probably keep this. It allows the author to avoid dead-branch optimization on a case-by-case basis:

    if (false) { .KeepLive; return; }

(This might be needed for occult code generation techniques like inter-function jumps. Which we don't support, but maybe someone is trying to do it anyway.)

erkyrath commented 3 years ago

Also note that the Z-code compiler uses a @jz opcode unconditionally for these cases. Could be a @jump instead.

Also from that forum thread:

Better optimized for-loops would be nice, using Z-code @inc_chk or @dec_chk where possible to save one instruction.

erkyrath commented 3 years ago

General note:

My usual rule is to not change code generation without a good reason. There's some value in being able to rebuild an old game file from old source exactly. There's a lot of value in being able to rebuild an old game file at all. (We really don't want changes that make game files larger -- that might bump a game over a Z-code size limit.)

Optimizations like this count as a good reason. (They make game files shorter!) However, we might want an optimization setting, whose values are "better code" or "compile like it's 2010", defaulting to "better code". On the flip side, this switch might make the compiler source pretty murky! We'll have to see what the implementation looks like.

curiousdannii commented 3 years ago

A branch gets emitted if there's nothing inside:

if (something) {}

something could have side effects, so it still needs to be emitted, but the branch doesn't.

erkyrath commented 3 years ago

Another example I just noticed:

print (string) foo, "^";

This generates @print_paddr "^" for the newline, even though @new_line is available. In Glulx, it uses @streamstr instead of @streamchar.

curiousdannii commented 3 years ago

while (true) loops (possibly all loops) with an unconditional return at their end still generate a final jump instruction.

curiousdannii commented 3 years ago

I7 makes a whole lot of dead code. Here's another example:

[ R_1078 ;
    if (debug_rules) DB_Rule(R_1078, 1078);
    ! [2: fill status bar with table of fancy status]
    (PHR_1076_r126 (T6_fancy_status));
    ! [3: rule succeeds]
    RulebookSucceeds(); rtrue;
    RulebookSucceeds(); rtrue;
    rfalse;
];

erkyrath commented 3 years ago

What I7 code generated that?

curiousdannii commented 3 years ago

From Counterfeit Monkey:

Rule for constructing the status line:
    fill status bar with Table of Fancy Status;
    rule succeeds

I suspect it happens any time the rulebook has a default outcome of success and the rule manually succeeds as well.

erkyrath commented 3 years ago

Probably, yeah. Thanks.

fredrikr commented 2 years ago

This code:

                        if(_w == 'n//') jump match_n;
                        if(_w == 's//') jump match_s;
                        if(_w == 'w//') jump match_w;
                        if(_w == 'e//') jump match_e;

gets compiled to this:

 1576:  c1 8f 03 11 e4 45       JE              L02,"n" [FALSE] 157f
 157c:  8c 01 97                JUMP            1714
 157f:  c1 8f 03 12 f5 45       JE              L02,"s" [FALSE] 1588
 1585:  8c 01 94                JUMP            171a
 1588:  c1 8f 03 14 8b 45       JE              L02,"w" [FALSE] 1591
 158e:  8c 01 97                JUMP            1726
 1591:  c1 8f 03 10 a2 45       JE              L02,"e" [FALSE] 159a
 1597:  8c 01 88                JUMP            1720

When an if-clause only holds a jump, it would be smart to perform that jump immediatiely in the compare+jump instruction.

erkyrath commented 2 years ago

I took a stab at the execution_never_reaches_here optimization, but it's trickier than it looks.

Say you have lines like

if (true) return;
if (expr) {...} else {...};

You'd expect the entire second line to be optimized out. But the else is a label, and labels are assumed to be reachable. So code generation starts up again.

We're going to need a way to "this statement is unreachable to begin with, so all labels within it are phantoms." But what if there's an explicit label inside the else clause? Then we really do want code generation to restart, and maybe some of those phantom labels are needed after all.

For a concrete example: there are two places in I6 lib 6/11 where this kind of optimization kicks in.

[ FullScoreSub i;
    ScoreSub();
    if (score == 0 || TASKS_PROVIDED == 1) rfalse;
    !...
];

    if (k >= ValueOrRun(player, capacity)) {
        if (SACK_OBJECT ~= 0) {                                                 
            if (parent(SACK_OBJECT) ~= player)                                  
                return L__M(##Take, 12);                                        
            !...

TASKS_PROVIDED and SACK_OBJECT may be known at compile time, so we want to skip the entire following block. But with the change I've got, only a few opcodes get skipped before we hit a label.

erkyrath commented 2 years ago

Okay, I have working implementation. This turned out to be a difficult problem!

Discussion: https://intfiction.org/t/i6-compiler-code-generation-improvements/54739 Branch: https://github.com/erkyrath/Inform6/tree/strip-dead-branch

I'd like more testing from more people before I open a PR.

This change does the following:

Discards unreachable opcodes and strings
Minimizes generated code for dead branches (if (0), if (1))
Detects if and switch statements where every branch returns
Detects loops that never exit (or always return)
Discards @jz constant opcodes where the constant is nonzero
Converts @jz 0 to an unconditional @jump
Discards a @jump to the very next instruction

It does not try to:

Minimize if (test) jump...
Minimize if (test()) {}
Make use of @inc_chk, @dec_chk

erkyrath commented 2 years ago

You can see a diff of the inform -a output when compiling Advent.inf here: https://gist.github.com/erkyrath/a2ab185e4f90096374cf5396332cf0fe

The majority of the changes are dropping unused labels; this makes no difference to the generated code.

In a few places (see lines 128, 223, 572, 623, 634, 812, 833, 874) we drop an unused label after a jump/return and a following instruction, because that instruction can never be reached.

On lines 520 and 532, we drop a @jz 1 branch, since that is impossible.

The two big chunks of code which get eliminated are associated with TASKS_PROVIDED and SACK_OBJECT, as noted above.

erkyrath commented 2 years ago

PR: https://github.com/DavidKinder/Inform6/pull/164 . This is largely the same as what I had last month, but I've improved the "statement never reached" warning logic.

DavidKinder commented 2 years ago

@erkyrath Can this issue be closed, or is there still more to be done?

erkyrath commented 2 years ago

This issue turned into a grab-bag. Ideas still open:

print "^"; generates @print_paddr "^" for the newline, even though @new_line is available. In Glulx, it uses @streamstr instead of @streamchar.
if (cond) {}; A branch gets emitted when there's nothing inside. The condition expression could have side effects, so it still needs to be emitted, but the branch doesn't.
'if (cond) jump label;` generates two branches when it could have only one.

Up to you whether those should be split out into separate issues. The first one is easy. The other two are hard and may never get done.

DavidKinder commented 2 years ago

I don't think there's much to be gained in splitting the issues out, happy to leave this open while there are still further optimizations possible.

heasm66 commented 1 year ago

Adding this ...

Looking at output generated with -a there's a lot of lines like this (not always get_prop or get_prop_addr, but for other opcodes too):

 3800  +03265 <*> get_prop     thing short_50 (each_turn) -> TEMP1 
 3800  +03269     push         TEMP1

and

 3479  +02ec9 <*> get_prop_addr o1 short_1 -> TEMP1 
 3479  +02ecd     store        p1 TEMP1

Couldn't this be simplified to (a tleast for zcode):

 3800  +03265 <*> get_prop     thing short_50 (each_turn) -> sp 

and

 3479  +02ec9 <*> get_prop_addr o1 short_1 -> p1

respectivly, and save 4 bytes per occurance?

erkyrath commented 1 year ago

I think this would boil down to adding a stack-pointer case to all the places in expressc.c that look like

# assemble_code_writing_to(..., temp_var1);
if (!void_flag) write_result_z(Result, temp_var1);

(Pretty sure none of these cases go on to access temp_var1 after that. It's always used as a strict temporary.)

erkyrath commented 1 year ago

Z-code could optimize x=x+1 to x++ (an @inc opcode).

Expressions like (x+0), (x-0), (x*1), (x/1) could be simplified at constant-folding time.

(They wouldn't be common in hand-written code, but I7-generated code does all sorts of awkward stuff. Or you might write (x*DICT_CHAR_SIZE), where DICT_CHAR_SIZE is a constant that could be 1, depending on compiler settings.)

erkyrath commented 1 year ago

print (char) ch generates roughly this in Glulx non-strict mode:

           if (c>=0 && c<256)
             @streamchar c;
           else
             @streamunichar c;

I was trying to protect very old interpreters that lacked the @streamunichar opcode, but I think it's safe to say we're past that now.

We can just generate @streamunichar ch;, unless ch is a compile-time constant known to be in the 0-255 range, in which case @streamchar ch; is better.

I do not plan to change the strict-mode RT__ChPrintC() routine, which does extra error checking.

heasm66 commented 9 months ago

Probably not trivial, but...

Code like below (in the CheapScenery object in PunyInform)

default:
    #ifdef SceneryReply;
    if(SceneryReply ofclass string)
        print_ret (string) SceneryReply;
    i = location.&cheap_scenery;
    w1 = self.number;
    if(SceneryReply(i-->w1, i-->(w1 + 1)))
        rtrue;
    #endif;
    "No need to concern yourself with that.";

The SceneryReply can both be a string or a routine, which makes the compiler to compile this unreachable code print_ret (string) SceneryReply; in the case it is a routine. When TXD tries to decompile this it throw a message to stderr, Warning: printing of nonexistent string. The decompiled snippet looks like below:

Routine 43b4, 2 locals

 43b5:  41 f9 01 43             JE              Ge9,#01 [FALSE] 43ba
 43b9:  b1                      RFALSE
 43ba:  e0 07 42 6c 10 bf 04 00 CALL_VS         109b0 (#10bf,#04) -> -(SP)
 43c2:  a0 00 c7                JZ              (SP)+ [TRUE] 43ca
 43c5:  8d 10 bf                PRINT_PADDR     42fc
 43c8:  bb                      NEW_LINE
 43c9:  b0                      RTRUE
 43ca:  e0 23 41 f1 10 00 4a 01 CALL_VS         107c4 (G00,#004a) -> L00
 43d2:  51 fb 08 ff             GET_PROP        Geb,#08 -> Gef
 43d6:  2d 02 ff                STORE           L01,Gef
 43d9:  54 02 01 00             ADD             L01,#01 -> -(SP)
 43dd:  6f 01 00 00             LOADW           L00,(SP)+ -> -(SP)
 43e1:  6f 01 02 00             LOADW           L00,L01 -> -(SP)
 43e5:  e0 2b 10 bf 00 00 00    CALL_VS         42fc ((SP)+,(SP)+) -> -(SP)
 43ec:  a0 00 41                JZ              (SP)+ [FALSE] RTRUE
 43ef:  b3 ...                  PRINT_RET       "No need to concern yourself
with that."

The compiler knows at compile time if SceneryReply is a routine or a string and could exclude the unreachable part.

erkyrath commented 9 months ago

That boils down to "constant-fold the ofclass operator", which seems like a reasonable idea.

heasm66 commented 2 months ago

Revisting this issue I have created a branch (https://github.com/heasm66/Inform6/tree/code_generating_optimizations) that changes the code generated for get_prop and get_prop_addr so that it writes the result directly to the stack_pointer instead of using TEMP1 for intermediate storage.

Change patterns like:

 3800  +03265 <*> get_prop     thing short_50 (each_turn) -> TEMP1 
 3800  +03269     push         TEMP1

-->

 3800  +03265 <*> get_prop     thing short_50 (each_turn) -> sp

and

 3479  +02ec9 <*> get_prop_addr o1 short_1 -> TEMP1 
 3479  +02ecd     store        p1 TEMP1

-->

 3479  +02ec9 <*> get_prop_addr o1 short_1 -> p1

This was pretty common patterns and quite easy to identify. It saves ~250 bytes on Advent and ~400 bytes on Curses.

Example code in expressc.c

        case PROP_ADD_OP:
             {   assembly_operand AO = ET[below].value;
                 check_warn_symbol_type(&ET[below].value, OBJECT_T, CLASS_T, "\".&\" expression");
                 check_warn_symbol_type(&ET[ET[below].right].value, PROPERTY_T, INDIVIDUAL_PROPERTY_T, "\".&\" expression");
                 if (runtime_error_checking_switch && (!veneer_mode))
                     AO = check_nonzero_at_runtime(AO, -1, PROP_ADD_RTE);
                 //assemblez_2_to(get_prop_addr_zc, AO,
                 //    ET[ET[below].right].value, temp_var1);
                 //if (!void_flag) write_result_z(Result, temp_var1);
                 if ((!void_flag) && (Result.value == 0)) {  // can be optimized and written directly to stack_pointer
                     assemblez_2_to(get_prop_addr_zc, AO,
                         ET[ET[below].right].value, Result);
                 }
                 else {
                     assemblez_2_to(get_prop_addr_zc, AO,
                         ET[ET[below].right].value, temp_var1);
                     if (!void_flag) write_result_z(Result, temp_var1);
                 }

             }
             break;
.
.
.
        case PROPERTY_OP:
             {
                 check_warn_symbol_type(&ET[below].value, OBJECT_T, CLASS_T, "\".\" expression");
                 check_warn_symbol_type(&ET[ET[below].right].value, PROPERTY_T, INDIVIDUAL_PROPERTY_T, "\".\" expression");
                 //if (runtime_error_checking_switch && (!veneer_mode))
                 //      assemblez_3_to(call_vs_zc, veneer_routine(RT__ChPR_VR),
                 //        ET[below].value, ET[ET[below].right].value, temp_var1);
                 //else
                 //assemblez_2_to(get_prop_zc, ET[below].value,
                 //    ET[ET[below].right].value, temp_var1);
                 //if (!void_flag) write_result_z(Result, temp_var1);
                 if (runtime_error_checking_switch && (!veneer_mode)) {
                     assemblez_3_to(call_vs_zc, veneer_routine(RT__ChPR_VR),
                         ET[below].value, ET[ET[below].right].value, temp_var1);
                     if (!void_flag) write_result_z(Result, temp_var1);
                 }
                 else {
                     if ((!void_flag) && (Result.value == 0)) {  // can be optimized and written directly to stack_pointer
                         assemblez_2_to(get_prop_zc, ET[below].value,
                             ET[ET[below].right].value, Result);
                     }
                     else {
                         assemblez_2_to(get_prop_zc, ET[below].value,
                             ET[ET[below].right].value, temp_var1);
                         if (!void_flag) write_result_z(Result, temp_var1);
                     }
                 }
             }
             break;

erkyrath commented 2 months ago

Thanks for looking at this.

As usual, any change that affects code generation (in a non-opt-in way) is a very large testing burden. (We have to test game file behavior rather than just checking that the game file is identical.) So I'd rather accumulate a bunch of these optimizations and put them in a future I6 release so we can test them all at once.

Before we get there, I'd also like to add some game-file-behavior tests in the inform6-testing suite! It currently only has "game file is identical" tests. So that needs some attention. It may be a few weeks before I have a chance to look at it.

heasm66 commented 2 months ago

I understand what you are saying...

How about introducing an option ($OPTIMIZE_CODE or something) that would allow optimizations to be introduced on an "use at your own risk" basis and improvements can be introduced in smaller chunks and over a longer period?

Doing it this way would lessen the burden to testing that no old gamecode will get broken. Newly constructed games that is in development will probable be more extensively tested and developer would be aware that it is a more of an "experimental" option.

We still need some testing, of course. I ran through Curses with a walkthrough (without issues) but I don't know how to automate this process.

erkyrath commented 2 months ago

We could do that, but it doesn't decrease the testing burden because then we have to test with the option both on and off. :)

We could also maintain an "optimization" branch in the repo for people who want to live on the bleeding edge.

heasm66 commented 2 months ago

In the same branch as above I've added suggestion-code for these simplifications:

x=x+1 ==> x++
x=1+x ==> x++
x=x+0 ==> skip
x=0+x ==> skip
x=x-1 ==> x--
x=x-0 ==> skip
x=x*1 ==> skip
x=1*x ==> skip
x=x/1 ==> skip

I've also tested a playthrough of Curses without issues.

Current code optimization in branch:

get_prop, get_prop_addr directly to sp without using intermediate temp-var
   Curses    -408 bytes
   Advent    -244 bytes

Simplify x=x+1 to x++ and other
   Curses     -52 bytes
   Advent     -40 bytes

Maintaing an "optimization" branch that the "in" people can run through extensive testing could work.

heasm66 commented 2 months ago

Adding this, may not be trivial to change.

switch(expr) {...}. If expr:

is a variable. The variable is stored in TEMP1
needs to be evaluated the result is stored on stack then immediately pulled back into TEMP1 before statements are executed.

Example 1:

[ ClearScreen window;
    switch (window) {
      WIN_ALL:    @erase_window -1;
      WIN_STATUS: @erase_window 1;
      WIN_MAIN:   @erase_window 0;
    }
];

Compiles to:

 5495  +049bc  [ ClearScreen window 

 5496  +049bd <*> store        TEMP1 window 
 5497  +049c0     je           TEMP1 short_0 (WIN_ALL) to L1 if FALSE
 5497  +049c5 <*> erase_window long_65535 
 5498  +049c9     jump         L0 
 5498  +049cc    .L1
 5498  +049cc     je           TEMP1 short_1 (WIN_STATUS) to L2 if FALSE
 5498  +049d1 <*> erase_window short_1 
 5499  +049d4     jump         L0 
 5499  +049d7    .L2
 5499  +049d7     je           TEMP1 short_2 (WIN_MAIN) to L3 if FALSE
 5499  +049dc <*> erase_window short_0 
 5500  +049df    .L3
 5500  +049df    .L0
 5501  +049df <*> rtrue

Example 2:

[ PrintOrRunVar var flag;
    switch (metaclass(var)) {
      Object:
        print (name) var;
      String:
        print (string) var;
        if (flag == 0) new_line;
      Routine:
        return var();
      default:
        print (char) '(', var, (char) ')';
    }
    rtrue;
];

Compiles to:

 5032  +045f4  [ PrintOrRunVar var flag 

 5033  +045f5 <*> call_vs      long_25 (veneer routine: Meta__class) var -> sp 
 5033  +045fb     pull         TEMP1 
 5034  +045fe     je           TEMP1 short_2 (Object) to L1 if FALSE
 5035  +04603 <*> call_2n      long_6 (veneer routine: PrintShortName) var 
 5036  +04608     jump         L0 
 5036  +0460b    .L1
 5036  +0460b     je           TEMP1 short_4 (String) to L2 if FALSE
 5037  +04610 <*> print_paddr  var 
 5038  +04612 <*> jz           flag to L3 if FALSE
 5038  +04616 <*> new_line     
 5038  +04617    .L3
 5039  +04617     jump         L0 
 5039  +0461a    .L2
 5039  +0461a     je           TEMP1 short_3 (Routine) to L4 if FALSE
 5040  +0461f <*> call_1s      var -> sp 
 5040  +04622     ret_popped   
 5041  +04623    .L4
 5042  +04623 <*> print_char   short_40 
 5042  +04626     print_num    var 
 5042  +04629     print_char   short_41 
 5043  +0462c    .L0
 5044  +0462c <*> rtrue

It seems that either the expr could be stored directly in TEMP1 or the checks for jumps could be done against the original variable.

The generating code is in states.c:

        case SWITCH_CODE:
                 match_open_bracket();
                 AO = code_generate(parse_expression(QUANTITY_CONTEXT),
                     QUANTITY_CONTEXT, -1);
                 match_close_bracket();

                 INITAOTV(&AO2, VARIABLE_OT, 255);
                 assemblez_store(AO2, AO);

                 parse_code_block(ln = next_label++, continue_label, 1);
                 assemble_forward_label_no(ln);
                 return;

Either code_generate could take a parameter where to store the result or parse_code_block could take a parameter on what to switch.

The special switch-statements that switches on verbs don't use the TEMP1 variable:


[ LanguageLM n x1;
  Answer,Ask:
            "There is no reply.";
  Attack:   "Violence isn't the answer to this one.";
  Blow:     "You can't usefully blow ", (thatorthose) x1, ".";
  Burn:     "This dangerous act would achieve little.";
  Buy:      "Nothing is on sale.";
...

Compiles to:

  414  +0044c  [ LanguageLM n x1 

  415  +0044d     je           sw__var short_1 (action) short_2 (action) to L0 if FALSE
  416  +00454 <*> print_ret    "There is no reply."
  418  +0045f    .L0
  418  +0045f     je           sw__var short_3 (action) to L1 if FALSE
  418  +00464 <*> print_ret    "Violence isn't the answer to this one"
  419  +0047b    .L1
  419  +0047b     je           sw__var short_4 (action) to L2 if FALSE
  419  +00480 <*> print        "You can't usefully blow "
  419  +0048f     call_2n      long_221 (routine: ThatorThose) x1 
  419  +00494     print_ret    "."
  420  +00497    .L2
  420  +00497     je           sw__var short_5 (action) to L3 if FALSE
  420  +0049c <*> print_ret    "This dangerous act would achieve little."
  421  +004b5    .L3
  421  +004b5     je           sw__var short_6 (action) to L4 if FALSE
  421  +004ba <*> print_ret    "Nothing is on sale."
  422  +004c5    .L4
...

erkyrath commented 1 month ago

I've made progress on your suggestions.

The inform6-testing repo now has support for running regtest scripts on the compiled game files. I'm using a complete Advent.inf run as the big test, but I also check Library of Horror, and all the unit tests that print "All passed", and a few other cases. See directory: https://github.com/erkyrath/Inform6-Testing/tree/master/reg

I have a branch with your optimizations (edited for code style): https://github.com/erkyrath/Inform6/tree/optimization

This passes all the regtest scripts. I don't want to push it in just yet though. There are more cases which could benefit from similar code. For example, @get_prop storing directly to a variable.

I might also want to port some of these fixes to the Glulx side.

Feel free to give the optimization branch a ride around the paddock in your Z-code work and see how it works out.

heasm66 commented 1 month ago

Sorry for the delay... (Feels like I mostly struggled with cherry-picking your commits into my own fork, guess I'm not fluent yet in git/github parlor.)

I have tested the optimization branch with a couple of files and found no issues.

These patterns are fairly common (I guess these are the ones you mean):

    1  +2169c     get_prop     o1 short_3 -> TEMP1 
    1  +216a0     store        m TEMP1 
...
 3516  +033f1 <*> get_prop_addr o1 short_1 -> TEMP1 
 3516  +033f5     store        p1 TEMP1

There're also a couple of places where a result is stored on the stack then immediately pulled and tested on. These are maybe harder to identify when they probably a spread out over multiple calls to generate_code_from.

 3780  +036f4 <*> get_parent   actor -> sp 
 3780  +036f7     pull         TEMP1 
...
 6140  +04aa9 <*> call_vs      long_25 (veneer routine: Meta__class) a -> sp 
 6140  +04aaf     pull         TEMP1

erkyrath commented 1 month ago

These patterns are fairly common (I guess these are the ones you mean)

Yeah, those. This change handles those cases neatly.

I couldn't find any other cases subject to this sort of pinhole optimization.

erkyrath commented 1 month ago

Whoops, found a bunch more! y = x+0; and y = x*1; and similar expressions can be compiled down into @store opcodes.

And yes, such lines turn up in I7-generated code all the time. Even the I6 parser has the line

artval=(o.&articles)-->(acode+short_name_case*LanguageCases);

...and LanguageCases is 1 in English.

heasm66 commented 1 month ago

Took the latest changes out for a testrun. It saved an additonal 132 bytes on Curses (down to 241.720 bytes for those that's counting...) and run through the walkthroughs without issues.

Nice!

heasm66 commented 4 weeks ago

A routine like this:

[ LanguageDirection d;
    switch (d) {
      n_to:    print "north";
      s_to:    print "south";
      e_to:    print "east";
      w_to:    print "west";
      ne_to:   print "northeast";
      nw_to:   print "northwest";
      se_to:   print "southeast";
      sw_to:   print "southwest";
      u_to:    print "up";
      d_to:    print "down";
      in_to:   print "in";
      out_to:  print "out";
      default: return RunTimeError(9,d);
    }
];

Compiles to:

  201  +00068  [ LanguageDirection d 
  202  +00069 <*> store        TEMP1 d 
  203  +0006c     je           TEMP1 short_13 (n_to) to L1 if FALSE
  203  +00071 <*> print        ""
  204  +00074     jump         L0 
  204  +00077    .L1
  204  +00077     je           TEMP1 short_14 (s_to) to L2 if FALSE
  204  +0007c <*> print        ""
  205  +0007f     jump         L0 
  205  +00082    .L2
  205  +00082     je           TEMP1 short_15 (e_to) to L3 if FALSE
  205  +00087 <*> print        "e"
  206  +0008a     jump         L0 
  206  +0008d    .L3
  206  +0008d     je           TEMP1 short_16 (w_to) to L4 if FALSE
  206  +00092 <*> print        "w"
  207  +00095     jump         L0 
  207  +00098    .L4
  207  +00098     je           TEMP1 short_17 (ne_to) to L5 if FALSE
  207  +0009d <*> print        "e"
  208  +000a2     jump         L0 
  208  +000a5    .L5
  208  +000a5     je           TEMP1 short_18 (nw_to) to L6 if FALSE
  208  +000aa <*> print        "w"
  209  +000af     jump         L0 
  209  +000b2    .L6
  209  +000b2     je           TEMP1 short_19 (se_to) to L7 if FALSE
  209  +000b7 <*> print        "e"
  210  +000bc     jump         L0 
  210  +000bf    .L7
  210  +000bf     je           TEMP1 short_20 (sw_to) to L8 if FALSE
  210  +000c4 <*> print        "w"
  211  +000c9     jump         L0 
  211  +000cc    .L8
  211  +000cc     je           TEMP1 short_21 (u_to) to L9 if FALSE
  211  +000d1 <*> print        "up"
  212  +000d4     jump         L0 
  212  +000d7    .L9
  212  +000d7     je           TEMP1 short_22 (d_to) to L10 if FALSE
  212  +000dc <*> print        "down"
  213  +000e1     jump         L0 
  213  +000e4    .L10
  213  +000e4     je           TEMP1 short_23 (in_to) to L11 if FALSE
  213  +000e9 <*> print        "in"
  214  +000ec     jump         L0 
  214  +000ef    .L11
  214  +000ef     je           TEMP1 short_24 (out_to) to L12 if FALSE
  214  +000f4 <*> print        "out"
  215  +000f7     jump         L0 
  215  +000fa    .L12
  215  +000fa <*> call_vs      long_714 (ref to symbol value: RunTimeError) short_9 d -> sp 
  215  +00101     ret_popped   
  216  +00102    .L0
  217  +00102 <*> rtrue

The instruction label L0 is only a @rtrue so it would be more effective to return immediately without the @jump first. The logic is the same if the instruction at the label is @rfalse or @ret_popped too.

I have a first code suggestion that does this in the branch https://github.com/heasm66/Inform6/tree/optimization_replacing_jump (currently one commit ahead of https://github.com/erkyrath/Inform6/tree/optimization) that maybe need a bit of refactoring. It saved about 300 bytes on my test project.

erkyrath commented 4 weeks ago

Nifty. I'll take a look.

erkyrath commented 4 weeks ago

Imported and tidied up: https://github.com/erkyrath/Inform6/compare/8916ebb..ae74890

I note that this doesn't eliminate the destination @rtrue, even if every jump to it gets replaced. Figuring out if we should do that is a bunch of extra work, which I think is not worthwhile.

curiousdannii commented 4 weeks ago

Well if a more thorough DCE ever gets implemented (as requested above https://github.com/DavidKinder/Inform6/issues/87#issuecomment-862348221) it would presumably handle that too.

heasm66 commented 4 weeks ago

Yeah, I thought about the possibility that the destination label most likely gets orphaned but couldn't find any easy solution. As I understand it the label currently have a flag if it's used or not (true/false). Maybe this could be a counter instead that keeps track on how many times it's used. If you remove a reference to a label you decrease the counter and when it reaches 0 it's unused.

I'm gonna make a stab at a jump immediately following a branch (above) because the fix probably also will be in the transfer_code function. I thought this was a pretty uncommon code pattern but my test project contained 66 occurrences, mostly in veneer and library code.

heasm66 commented 4 weeks ago

A suggestion for optimization when a jump immediately follows a branch is in this commit: https://github.com/heasm66/Inform6/commit/893c5314f298a2e3f1d57fa1ace7e9a3925a9204

Some current benchmarks (including this change):

Advent (Standard Library):
   Inform 6.42 GV2: 118.180 bytes
   Inform 6.43 GV2: 117.400 bytes (-780 bytes in z-code optimization)

Curses (Standard library):
   Inform 6.42 GV2: 242.964 bytes
   Inform 6.43 GV2: 241.924 bytes (-1.040 bytes in z-code optimization)

Dorm Game (PunyInform Library):
   Inform 6.42 GV2: 117.078 bytes
   Inform 6.43 GV2: 116.712 bytes (-366 bytes in z-code optimization)

Hibernated 1 PunyInform Library):
   Inform 6.42 GV2:  87.940 bytes
   Inform 6.43 GV2:  87.684 bytes (-256 bytes in z-code optimization)

erkyrath commented 4 weeks ago

Well if a more thorough DCE ever gets implemented

There is solid dead-code elimination now. But it happens during parse_routine(), whereas this change is in assemble_routine_end(), which is a later phase!

Maybe this whole question needs to be addressed earlier, at parse_routine() time. But that's tricky because when you're originally generating the (forward) jump, you don't know the destination opcode yet.

heasm66 commented 3 weeks ago

I've sifted through the thread and see these remaing issues:

Make use of @inc_chk and @dec_chk in for-loops (mentioned here and here).
Replace @print_paddr "^" with @newline (mentioned here and here).
Minimize if (test()) {} (mentioned here and here).
Constant-fold the ofclass operator (mentioned here and here).
Orphaned labels when optimizing branching (mentioned here).
Avoid TEMP1 with switch(expr) {...} (mentioned here).

Any other?

erkyrath commented 3 weeks ago

We already catch print "^"; although the mechanism could be more general. (See TODO comment on compile_string().)

DavidKinder / Inform6

Code generation optimizations #87