KSP-KOS / KOS

Fully programmable autopilot mod for KSP. Originally By Nivekk
Other
697 stars 230 forks source link

.ksm file optimization (experiments and results) #1732

Open chippydip opened 8 years ago

chippydip commented 8 years ago

Sorry for the long post, I wanted to brain dump this stuff in case anyone else is interested.

1712 got me thinking about .ksm files in general and why they are so gosh-darn big. In my experience they are always larger than a minified text version, and often bigger than the text version with whitespace and comments. This has always seemed ridiculous to me so I decided to start investigating a couple days ago and I now have numbers and suggestions!

All my testing so far as been done with boot.ks which I've heavily hand-optimized in the past. This may not be totally representative of scripts in general, but being able to compete with a hand-optimized and minified version seems like as good a place as any to start:

I started by writing a disassembler for ksm files so I could see what was going on under the hood. The output below is from the summary section of this tool (I also dump the full symbol table and instruction listings).

Starting Point

File Size: 2593 bytes

Arguments: 147 in 922 bytes (6.3 avg)
  Strings: 818 bytes
  Label: 26 in 182 bytes
  Magic: Args 33 Ops 33

Code: 365 ops w/ 324 args in 1013 bytes
  PUSH                      x150       450
  CALL                      x31        155
  PUSHSCOPE                 x14         70
  LABELRESET                x19         57
  RETURN                    x17         51
  PUSHDELEGATERELOCATELATER x8          40
  POPSCOPE                  x13         39
  BRANCHFALSE               x11         33

Debug: 600 bytes

First Pass

The debug section takes a ton of space. It's like bundling a .pdb and functionally isn't required (this is no different than minimizing a text file to be all on one line). I also noticed that there were 33 separate copies of the "magic" stack marking value in the arguments table. Apparently KOSArgMarkerType doesn't have an Equals/GetHashCode override so a new value is added for pretty much every function call in the code.

File Size: 1971 bytes

Arguments: 115 in 890 bytes (7.7 avg)
  Strings: 818 bytes
  Label: 26 in 182 bytes
  Magic: Args 1 Ops 33

Code: 365 ops w/ 324 args in 1013 bytes
  PUSH                      x150       450
  CALL                      x31        155
  PUSHSCOPE                 x14         70
  LABELRESET                x19         57
  RETURN                    x17         51
  PUSHDELEGATERELOCATELATER x8          40
  POPSCOPE                  x13         39
  BRANCHFALSE               x11         33

Debug: 10 bytes

Optimizer

Looking at the contents of the arguments section, ~34 of the values were compiler-generated labels (mostly due to LABELRESET instructions where function entry points had been labeled with something other than the normal @NNNN labels). I went through and changed all the labels back to default in-order values which got rid of all but the first LABELRESET instruction. I also updated all jump instructions to use relative offsets rather than labels.

File Size: 1700 bytes

Arguments: 98 in 673 bytes (6.8 avg)
  Strings: 551 bytes
  Label: 8 in 56 bytes
  Magic: Args 1 Ops 33

Code: 347 ops w/ 306 args in 959 bytes
  PUSH                      x150       450
  CALL                      x31        155
  PUSHSCOPE                 x14         70
  RETURN                    x17         51
  PUSHDELEGATERELOCATELATER x8          40
  POPSCOPE                  x13         39
  BRANCHFALSE               x11         33
  POP                       x20         20

Debug: 10 bytes

This was about the best I could do with an external tool. I'm toying with the idea of extending this to be an optimizing compiler with at least a list of peephole optimizations which should shave off a bit more space (and a few IPU when running the script). I hand-optimized the assembly for one of the functions and was able get a ~20% reduction in both instructions and bytes, but most of those gains were from eliminating the scope push/pop/local and managing the single parameter just on the stack with DUP and SWAP as needed. I suspect that's probably very near the upper bound of what's even theoretically possible with an advanced optimizer.

Serialization changes

Much easier than instruction gymnastics was taking a look at tweaks to the kOS code. The biggest issue at this point was that I only had 98 values in the arguments section (down from 147 at the start), but references to these arg were all 2 bytes long since the section itself was 673 bytes long. Switching to ordinal indexes rather than byte-offsets was a simple change that saves a bunch of space. (I also fixed the stack marker properly).

File Size: 1394 bytes

Arguments: 98 in 673 bytes (6.8 avg)
  Strings: 551 bytes
  Label: 8 in 56 bytes
  Magic: Args 1 Ops 33

Code: 347 ops w/ 306 args in 653 bytes
  PUSH                      x150       300
  CALL                      x31         93
  PUSHSCOPE                 x14         42
  RETURN                    x17         34
  POPSCOPE                  x13         26
  PUSHDELEGATERELOCATELATER x8          24
  BRANCHFALSE               x11         22
  POP                       x20         20

Debug: 10 bytes

New Opcodes

At this point most of the low-hanging fruit had been addressed so gains slowed down. Still, there are plenty of common patterns that can benefit from smarter encoding in the instruction stream. One of the most obvious at this point was duplicate string values for essentially the same variable (x vs $x vs $x*). I addedPushVarandPushFunc` opcodes to generate the proper version of a given identifier without having to pollute the arguments table. Adding a PushStackMark variant also seemed reasonable as that saves a byte per function call. Speaking of function calls, as far as I can tell the first MLField of the Call instruction isn't actually used anymore. Furthermore, the "" calls don't really need a dummy marker value if they had their own dedicated instruction, so I added CallIndirect and CallDirect. I also stripped all decorators off the identifiers for call operations since they are re-added if missing by the Execute method anyway.

File Size: 1242 bytes

Arguments: 88 in 593 bytes (6.7 avg)
  Strings: 472 bytes
  Label: 8 in 56 bytes
  Magic: Args 0 Ops 0

Code: 346 ops w/ 235 args in 581 bytes
  PUSH                      x71        142
  PushVar                   x38         76
  CallDirect                x25         50
  PUSHSCOPE                 x14         42
  RETURN                    x17         34
  PushStackMark             x33         33
  POPSCOPE                  x13         26
  PUSHDELEGATERELOCATELATER x8          24
  BRANCHFALSE               x11         22
  POP                       x20         20
  PushFunc                  x8          16
  ADD                       x13         13

Debug: 10 bytes

Immediate Values

The next big thing cluttering the arguments table was constant values for various instructions. Most of these values are only used once and some are duplicated in the table since scope ids are shorts while branch offsets and return depth are integers which prevents them from de-duping. Branch offsets are typically small signed values so I added BranchIfFalseImm, BranchIfTrueImm, and JumpImm to use a signed-byte after the opcode to store the destination offset when possible. Scope IDs and return depth are usually small non-negative numbers so I did something similar with PushScopeImm, PopScopeImm, and ReturnImm using an unsigned byte in these cases.

File Size: 1142 bytes

Arguments: 62 in 493 bytes (7.9 avg)
  Strings: 472 bytes
  Label: 8 in 56 bytes
  Magic: Args 0 Ops 0

Code: 346 ops w/ 158 args in 581 bytes
  PUSH                      x71        142
  PushVar                   x38         76
  CallDirect                x25         50
  PushScopeImm              x14         42
  ReturnImm                 x17         34
  PushStackMark             x33         33
  PopScopeImm               x13         26
  PUSHDELEGATERELOCATELATER x8          24
  BranchIfFalseImm          x11         22
  POP                       x20         20
  PushFunc                  x8          16
  ADD                       x13         13

Debug: 10 bytes

Immediate Labels

At this point there were still a few labels left in PUSHDELEGATERELOCATELATER instructions. Really these aren't much different from relative jump offsets, but since they cross code block CodePart boundaries I decided to use an absolute instruction offset in the ksm file for the immediate version (using a UInt16). I also baked the WithClosure flag into the instructions themselves. Finally, I changed the starting "previousLabel" value to "@0000" to make the first "LABELRESET" opcode unnecessary in most cases.

File Size: 1086 bytes

Arguments: 54 in 437 bytes (8.0 avg)
  Strings: 416 bytes
  Label: 0 in 0 bytes
  Magic: Args 0 Ops 0

Code: 346 ops w/ 142 args in 581 bytes
  PUSH                      x71        142
  PushVar                   x38         76
  CallDirect                x25         50
  PushScopeImm              x14         42
  ReturnImm                 x17         34
  PushArgBottom             x33         33
  PopScopeImm               x13         26
  PushRelocateDelCaptureImm x8          24
  BranchIfFalseImm          x11         22
  POP                       x20         20
  PushFunc                  x8          16
  ADD                       x13         13
  STORELOCAL                x11         11

Debug: 10 bytes

(These were the only instructions that I didn't fully implement since they require translation. The rest have proper Execute methods and should in theory just work, though I was focused on what's possible in a .ksm file and didn't try to run any of the modified versions)

Ideas

chippydip commented 8 years ago

PR submitted for the simple changes: https://github.com/KSP-KOS/KOS/pull/1733

Experimental opcodes for the curious: https://github.com/chippydip/KOS/commit/7310edfe1dc2b3145968aaad994ea655cd712898

Dunbaratu commented 8 years ago

The debug section takes a ton of space. It's like bundling a .pdb and functionally isn't required (this is no different than minimizing a text file to be all on one line).

I consider it required. I don't want to have to field the complaints from users who find that runtime errors no longer tell them where in the source script the error happened. This is also why when I run a minimizer, I don't eliminate the line endings. I still want to preserve line count for the sake of error messages, and that information is totally worth an extra \n per line.

If you remove the debug info, I'd insist that it is only removed by use of a command option, rather than being removed unilaterally for everybody always. I could possibly still generate a stack trace with the error, like C# does, but it wouldn't narrow you down to the exact line - just the function.

Dunbaratu commented 8 years ago

I'll look at the rest of this in detail after the weekend. My time is sort of committed until then. It looks interesting. And yes, using immediate jumps for the relative branches does shave a lot of arguments out, and the fact that the arg marker didn't compact down to 1 is a big find. I think the last time I looked at it in detail, the arg marker was just a string, not a separate type of its own. That change came later and must have caused the duplication of the arg marker.

Another minimizing option is to create a few "hardcoded" arg pointers for args we know always tend to exist in every program no matter what - like we know every program is going to have a zero, a one, and an argbottom mark operand in it - so why store them at all? Why not just have magic codes like "when you see an address of -1, that means the arg is 0, when you see an address of -2, that means the arg is really the argbottom mark...." etc.

chippydip commented 8 years ago

Take your time, like I said I just started playing with this last week and figured it might be interesting to other folks. (And, man, did this post get a lot longer than I anticipated)

Debug info is definitely worth having access to. Toward the end I suggest storing it as a separate .kdb file or having a config option to turn it off. Another idea would be to continue to include it but not count the debug info against the disk space limit.

Duplicate arg markers was clearly just an oversight. PseudoNull has the Equals/GetHashCode overrides needed to prevent that so I just applied the same fix to the arg marker type. It doesn't save a ton of space relative to some of the other things I looked at, but it's the simplest change of the bunch.

I thought about adding hard coded arguments, but one of the biggest gains in file size comes when you can keep numArgIndexBytes at 1 instead of 2, which means keeping the argument table smaller is better (combined with switching to ordinal instead of byte offsets).

Instead, I went with variations of some existing opcodes that use immediate mode arguments (encoded directly in the byte stream) rather than referencing the arguments table at all. Small values will fit in a single immediate byte anyway, and larger values are unlikely to be used more than once. This does require new opcode variants and a new MLField parameter to flag how the value is to be encoded, but currently only 54 out of a possible 256 opcodes are in use so adding a few more didn't seem like a big deal.

Once I got to the end of the work outlined above and had dropped the arg table size down from 147 to 54 I realized that nearly everything that remained was now a string (and mostly only things that appeared in the program text). So now I'm playing with the idea of using immediate-mode arguments for all non-string values. I'm also going a bit crazy and thinking about what things would look like with a CISC rather than RISC instruction set. Consider the following sample function:

PUSHSCOPE                 1 0
PUSH                      "f"
SWAP
STORELOCAL
ARGBOTTOM
PUSH                      magic
PUSH
PUSH                      "$f"
CALL                      "" "logfile()"
POP
PUSH                      magic
PUSH                      "$f"
PUSH                      null
CALL                      "" "delete_deprecated()"
POP
POPSCOPE                  1
PUSH                      0
RETURN                    0

18 ops, 34 bytes

That function is really only calling two other functions, the rest of the code is just shuffling things around on the stack. Assuming per-instruction overhead dominates execution time (think book-keeping code vs cpu.PushStack(Argument) to execute each of the PUSHs), then an optimal instruction set would probably try to minimize the number of opcodes with a minimal amount of extra complexity. Such an instruction set might look something like this:

PUSHSCOPE                 1 0
STORELOCAL                "f"
ARGBOTTOM
CALLVOID                  "logfile()" "" "$f"
CALLVOID                  "delete_deprecated()" "$f" null
RETURNVOID                1

6 ops, 18 bytes

That's a big reduction in both opcodes and .ksm file size. The main ideas here are to have instructions that cover the common cases more efficiently (CALLVOID pops the return value for you at the end and RETURNVOID pushes a 0 before returning) and to use the stack only when needed. A huge number of opcodes are saved by including arguments as part of the call.

For the actual encoding I'm thinking about include a 0 terminated list of nibbles (half bytes, padded to a byte boundary as needed) to specify the encoding for each argument, followed by the encoded data. It's not shown in this example, but arguments could still be popped (or peeked) from the stack if needed, it just wouldn't be required.

Most of the additional work here would be in (de)serialization and even then it wouldn't be that much more work. OpcodeCall.Execute would likely just push all the arguments before running it's existing code, but it would also be possible to pass them around as a separate list if that made sense.

An optimizing compiler could do even better:

CALLVOID                  logfile() "" [0]
CALLVOID                  delete_deprecated() @ null
ARGBOTTOM
RETURNVOID                0

4 ops, 11 bytes

This is realizing that nothing references this scope as it's parent so the scope can be elided and the local variable (actually parameter) can just be managed directly on the stack. This also demonstrates the ideas of indexed peeking (the "[0]" arg reference) and popping ("@"). This is just under 1/3 of the original code and 2/9ths the number of opcodes.

So, yeah...I'm mostly playing with this for fun and personal use, but if any of this looks like stuff that's you'd like to incorporate in the official release at some future date let me know!

(I also put together a spreadsheet documenting the current instruction set: https://docs.google.com/spreadsheets/d/12Xo3-4Ur-jkc3pqgRTD5jUFXmL3k0mjS97Mp53eNU7g )

dewiniaid commented 8 years ago

I consider [the debugging section] required. I don't want to have to field the complaints from users who find that runtime errors no longer tell them where in the source script the error happened.

Other than some issues like "What happens if you compile a script on a probe with no RT connection?" (unlikely, but possible if the compilation itself is scripted), is there any reason the debugging section can't be a separate file on the archive somewhere that kOS references to get line numbers? (Or potentially even original source code)

hvacengi commented 8 years ago

There is no guarantee that either file on the archive would be in sync with the ksm file you are running. So the attempt to read the source code could be gibberish, pointing to the wrong line or even implying that kOS is executing the code out of order.

chippydip commented 8 years ago

To separate out the debug info you'd need to include a source hash code as part of the .ksm file and then write the debug info to something like 0:/debug/<hash_code>.kdb.

hvacengi commented 8 years ago

I know that this issue has been out there for a while, and that you have a pending pull request implementing a few of these changes. I just wanted to start up the conversation again because I came across an easy improvement while working on #1857. Since kOS ships with a C# implementation of gzip, we can have the ksm file store itself using that compression. In my initial testing, a 14,972B ksm file (11,967B ks source) was reduced to 8,426B by enabling the compression. I think that we will still want to implement some of your changes, specifically the ones in your PR look particularly easy to take advantage of. I'm going to edit the documentation for CompiledObject while I'm making the modification to support gzip. @chippydip if you're still around and interested in working on this optimization, I'd be interested in looking at trying to work on your PR for one of the next couple of releases.

chippydip commented 8 years ago

Compression sounds like a great idea. I actually considered trying something like that in KOS code after the newer file APIs were added but decided to look into opcode optimization instead. Anything that helps reduce compiled size to help incentivize storing binary instead of source versions is a good change in my book.

I'd be happy to answer any questions and/or make fixes to that PR, but I'm pretty busy with work right now so probably won't have time for too much beyond that.