Shrinking inline cache sizes

markshannon commented 1 year ago

See https://github.com/faster-cpython/ideas/discussions/396 for earlier discussion predating the merge of LOAD_METHOD and LOAD_ATTR

Currently the following instructions have multiple entry caches

Instruction	Cache size (in 16 bit units)
BINARY_SUBSCR	4
STORE_ATTR	4
LOAD_ATTR	9
LOAD_GLOBAL	5
CALL	4

STORE_ATTR is only about 1% of instructions (dynamically) so let's not worry about that for now. BINARY_SUBSCR is also uncommon, at about 2% of instructions (dynamically), but it would be nice to merge it with BINARY_OP so it is worth a look. Calls are complicated, so the cache use of CALL is probably OK, but CALLs do represent about 6% of instructions (dynamically) so it wouldn't do any harm to shrink the cache use a bit.

Which makes LOAD_GLOBAL and LOAD_ATTR the main candidates for shrinking.

There are two main consumers of space in the caches: version numbers and object references. We should look to shrink these to 16 bit, or even 8 bit, values.

Shrinking version numbers

The only way to shrink the space for a version is to reduce the range of possible versions. We can do this by limiting the allowed number of versions, which makes it more likely that we run out of versions, or by partitioning the number space so that more commonly used version inhabit a smaller range.

Limiting consumption of versions

If we are using very large numbers of versions it is likely that a few misbehaved objects are consuming the majority of the versions. We can fix that by limiting the number of versions that a single object can consume. E.g. we could say that an individual class can only consume one hundred versions, thereafter it gets no new version number. Rapidly changing classes are difficult to specialize for, so this should be no loss. Similar reasoning applies to function and dictionary key versions.

Dictionary key versions

The most heavily used keys is the keys of the builtins module. Depending on whether it is feasible to restrict all keys version to 64k, we can partition the key versions as follows:

Key	Fits in 16 bits	Needs 32 bits
Builtins keys	1-255	1-255
Other module keys	256+	256-64k
All others	256+	64k+

Type versions

For specialization of binary operators, it is probably worth reserving some fixed versions for common types like int and reserving a few more numbers for immutable classes with custom operators, like decimal or ndarray. Other than that, our best approach is to try to restrict all versions to 64k.

First of all we want to avoid allocating version numbers to classes that are being constructed, to avoid consuming versions at that point. I don't think this is currently an issue, but it is something that we should guard against. Next we want to prevent a few classes consuming all the versions and prevent other classes from getting valid (or good) version numbers. We can do that by rationing the number of versions that an individual class can have (to 50?). That way, if we have a program with 10,000 well behaved classes (consuming 3 versions each) and a 100 misbehaving classes we only need 35,000 version numbers, fitting in 64k.

Function versions

We already restrict the function version usable for BINARY_SUBSCR to 16 bit. The problem with functions is that there could be many of them. A very large program might include tens of thousands of functions. Functions almost never change (change is defined as having __code__ or defaults changed), so we could like the number of versions to function to a very low number (say 5), and assume that won't specialize calls to more than ~60,000 functions.

We currently allocate a version number to each code object we create, so it might be better to allocate code object versions lazily when required.

Summary

We can restrict the version numbers to 16 bits, by limiting the number of versions that a single object can consume.

Version number	Reserved values	Versions per object (these are guesses)
Class	1-255	50
Dictionary keys	1-255	1000 for modules, 20 for class cached keys
Functions	---	5

Shrinking object caches

By caching objects on the class, instead of inline, we can shrink the size of the object cache from 8 bytes to 1 byte. See https://github.com/faster-cpython/ideas/issues/532

`LOAD_GLOBAL`

LOAD_GLOBAL has two specializations, one for lookups on the current module and one for the builtins. It is LOAD_GLOBAL_BUILTIN that needs the most space. Currently it uses 10 bytes for its cache, 2 for the counter, 4 for the current module's dict key version, 2 for builtins module's dict key version, and 2 for the index into the builtins dict's values.

We can shrink this to 6 bytes. As follows:

Counter: 2 bytes
Module's dict key version: 2 bytes, using 16 bit versions as described above
Builtin dict key version: 1 byte, by reserving the dict keys up to 255 for the builtin dict
Index into the values: 1 byte. len(builtins.__dict__) == 157. It is unlikely that a program will add 100 values to the builtins.

LOAD_GLOBAL_MODULE only needs one version, so can use a 16 bit index.

`LOAD_ATTR`

The 18 byte cache of LOAD_ATTR consists of a 2 byte counter, a 4 bytes type version, a 4 byte dict key version, and an 8 byte pointer (chopped into 2byte chunks for alignment)

Using 16 bit versions, and 1 byte index into a per-class object cache, we can shrink the cache from 18 to 7 (aligned to 8) bytes. In theory we could allow 24 bit type version numbers, although I don't know how practical that would be.

markshannon commented 1 year ago

@carljm @Fidget-Spinner

gvanrossum commented 1 year ago

Note that the code generator currently measures cache entries in units of 16 bit words, e.g. counter/1, type_version/2. Of course we can do further unpacking in the code, but if we're going to access single bytes we might switch this to byte counts (the generator can insist that the total count is even).

Fidget-Spinner commented 1 year ago

I'm working on the per-class object cache. Current status is that everything compiles but continuously segfaults :).

ericsnowcurrently commented 1 year ago

The only way to shrink the space for a version is to reduce the range of possible versions.

Another possibility (or related concept), if the very slight overhead isn't too much: variable-length ints (https://developers.google.com/protocol-buffers/docs/encoding#varints)

gvanrossum commented 1 year ago

Since the cache entry should be able to store every possible value of a given version, I don't see how variable-length ints would help.

carljm commented 1 year ago

Type versions ... Other than that, our best approach is to try to restrict all versions to 64k.

IG server has >65k modules imported in production. I'm not sure what a reasonable assumption for types-per-module is (there may be a lot of empty __init__.py modules, but also lots of modules will have many more than one type), and I don't have data on exactly how many types we have, but it seems realistic that we have (or could in future reach) >64k types. How would this behave in a world where type versions are restricted to 64k?

Edited to add: if I find time, I'll modify a build to report type version tags and run it in prod, to see what maximum type version we currently reach.

markshannon commented 1 year ago

IG server has >65k modules imported in production.

In practice it would mean that we couldn't specialize access to attributes of those classes (and instances of those classes) with a type version exceeding 2**16 in the tier 1 interpreter.

We might be able to handle them in the tier 2 optimizer, but it will be tricky without the profiling data from the tier 1 optimizer.

64k is quite a low limit, but a more sensible limit of 1M (20 bits) or 16M (24 bits) is impractical to implement, and 32 bits is quite wasteful. I don't know what the best solution is.

There is no requirement that each instruction cache is broken up into 16bit chunks, merely that it is a multiple of 16bits in total. By breaking it up into bytes, we could potentially use 1 byte and 3 byte values.

faster-cpython / ideas