lizmat commented 6 months ago

Currently, MoarVM uses an 32-bit unsigned int as an index into the string heap.

The largest bytecode file that we know of, is the bytecode file for the core C setting, which clocks in at 38K+ strings, of which 21K+ are used for cuids (but that's another matter).

It appears unlikely to me that we will ever need to have more than 64K unique interned strings (determined at compile time!) in a compunit.

So what would happen if we would reduce the string heap indices to a 16-bit unsigned int?

Con

Max 64K of unique strings in a compilation unit

For reference: a simple program with say "a" ~ "b" ~ "c" ~ "d" ... ~ 'crxq' produces this output on my M1mini:

$ time raku --stagestats 64Kstrings
Stage start      :   0.000
Stage parse      :   7.691
Stage syntaxcheck:   0.000
Stage ast        :   0.000
Stage optimize   :   2.819
Stage mast       :   0.004
Stage mbc        :   0.001
Stage moar       :   0.000
real    10.58s
user    10.53s
sys 0.13s

Trying to precompile that in a module, doesn't work: killed the process after it had been running for 15 minutes and eating 8GB of RAM. Which I think shows another deficiency, but that'll be for another day.

So I'd say it will be unlikely we will see that in actual use today. In any case, this does not prevent you from having more than 64K unique strings in a program: the limitation is per compunit!

Pro

Overall bytecode file size reduced by 10+%

Calculated from the core "c" setting using the tools provided by MoarVM::Bytecode in responses.

Frame sizes reduced

Depending on number of string indices used in a frame. Remember that each dispatch_x and const_s op has one string index in it. The core C setting has 108447 dispatch_o and 61392 const_s opcodes in it. Given that the core C setting currently uses 11983692 bytes for opcodes, this in itself would mean a saving of (2 * (108447 + 61392)) = 339678 = 339678 / 11983692 = 2.8%.

Smaller frame sizes mean more opportunities for inlining, and thus faster execution!

FWIW, there are other indices in the bytecode that currently use a 32-bit unsigned index, that could probably be reduced to 16-bit unsigned. But that's for another day.

lizmat commented 6 months ago

General information about C setting:

$ bcinfo c            
File: /Users/liz/Github/rakudo/blib/CORE.c.setting.moarvm (28946040 bytes)
Opcodes: 7934130 bytes
String Heap: 38100 different strings, 543292 bytes
Frames: 24949 frames, 11983692 bytes
  276382 local variables
  552403 lexical variables
    6518 handlers
  537714 static lexical values
   31048 local debug names

Number of opcodes in the C setting that have a string index in them:

$ bceval c '.grep(*.operands.first("str")).elems'
228568

So this would mean say (2 * 228568) / 7934130 = 5.7% saving in frame size, obviously some frames more than others.

Each frame has 2 string indices: one for the cuid, and one for any name associated with the frame. So that would mean 2 * 24949 = 49898 bytes.

Each lexical variable has a string index, so 2 * 552403 = 1104806.

Each local debug name has a string index, so 2 * 31048 = 62096.

Total savings in frame: 49898 + 1104806 + 62096 = 1216800. So that means 1216800 / 11983692 = 10.2% saving.

lizmat commented 6 months ago

10 most common opcodes:

% bceval c '.map(*.name).Bag.sort(-*.value).head(10)' 
set => 121552
dispatch_o => 108459
decont => 97639
wval => 79694
const_s => 61445
getcode => 60327
getlex => 32942
assertparamcheck => 25811
wval_wide => 25769
checkarity => 24949

10 most occurring consecutive opcode pairs

$ bceval c '.rotor(2 => -1).map(*>>.name.join(",")).Bag.sort(-*.value).head(10)'
dispatch_o,set => 35882
set,decont => 35252
dispatch_o,dispatch_o => 34717
decont,const_s => 34636
const_s,dispatch_o => 22742
decont,set => 22351
getcode,push_o => 20096
push_o,getcode => 20095
param_rp_o,dispatch_o => 19685
set,wval => 19567

Yeah, I know this is showing off, but may be an inspiration for future opts.

niner commented 6 months ago

Your test code is rather taxing on the compiler as that chained concatenation will lead to an extremely deep AST. Even the length of the line may be a problem by itself.

This generates a module that has > 106496 compile time known strings: rakudo -e 'for "a".."z" -> $prefix {say "say <" ~ (0 .. 4095).map({"$prefix$_"}).list.Str ~ ">;\n"}' > lib/Large.rakumod Precompiling this large module taks about a second or two, consumes some 2 GB of memory and the result works just fine.

This also shows how you would end up with such a large number of strings: it's most likely in generated code. 64K isn't all that many. Think of some Unicode related module that just contains all characters as invdividual strings somewhere. RakuAST makes no difference here btw. These strings would still be literals that get stored in the string heap. Early on there were attempts to use precomp files as storage format for installed module's metadata. This would still be a reasonable idea even today. So it may well be that someone will try that with different data.

Considering how often we have run into such arbitrary limits with the JVM I'd say let's not inflict this kind of pain onto others.

lizmat commented 6 months ago

@niner That's a good point. my proposal would actually break my uniname-words module:

uniname-words $ bcinfo .precomp/DF77E765162CF1798065DEE15C0FD8CA729B512F/14/143DC7A07AAE61D6F77EB4832D9CDA89D8CC5CA4
File: /Users/liz/Github/RAKU/uniname-words/.precomp/DF77E765162CF1798065DEE15C0FD8CA729B512F/14/143DC7A07AAE61D6F77EB4832D9CDA89D8CC5CA4 (15577504 bytes)
Opcodes: 16010 bytes
String Heap: 268132 different strings, 4666300 bytes
Frames: 59 frames, 49332 bytes
     554 local variables
    2500 lexical variables
      22 handlers
    2453 static lexical values
      27 local debug names

268132 different strings.

So I guess the question in the title can be answered with a resounding YES

Raku / problem-solving

Will we ever need more than 64K unique strings at compile time? #424

Con

Max 64K of unique strings in a compilation unit

Pro

Overall bytecode file size reduced by 10+%

Frame sizes reduced