Open markshannon opened 1 year ago
@carljm @Fidget-Spinner
Note that the code generator currently measures cache entries in units of 16 bit words, e.g. counter/1, type_version/2
. Of course we can do further unpacking in the code, but if we're going to access single bytes we might switch this to byte counts (the generator can insist that the total count is even).
I'm working on the per-class object cache. Current status is that everything compiles but continuously segfaults :).
The only way to shrink the space for a version is to reduce the range of possible versions.
Another possibility (or related concept), if the very slight overhead isn't too much: variable-length ints (https://developers.google.com/protocol-buffers/docs/encoding#varints)
Since the cache entry should be able to store every possible value of a given version, I don't see how variable-length ints would help.
Type versions ... Other than that, our best approach is to try to restrict all versions to 64k.
IG server has >65k modules imported in production. I'm not sure what a reasonable assumption for types-per-module is (there may be a lot of empty __init__.py
modules, but also lots of modules will have many more than one type), and I don't have data on exactly how many types we have, but it seems realistic that we have (or could in future reach) >64k types. How would this behave in a world where type versions are restricted to 64k?
Edited to add: if I find time, I'll modify a build to report type version tags and run it in prod, to see what maximum type version we currently reach.
IG server has >65k modules imported in production.
In practice it would mean that we couldn't specialize access to attributes of those classes (and instances of those classes) with a type version exceeding 2**16 in the tier 1 interpreter.
We might be able to handle them in the tier 2 optimizer, but it will be tricky without the profiling data from the tier 1 optimizer.
64k is quite a low limit, but a more sensible limit of 1M (20 bits) or 16M (24 bits) is impractical to implement, and 32 bits is quite wasteful. I don't know what the best solution is.
There is no requirement that each instruction cache is broken up into 16bit chunks, merely that it is a multiple of 16bits in total. By breaking it up into bytes, we could potentially use 1 byte and 3 byte values.
See https://github.com/faster-cpython/ideas/discussions/396 for earlier discussion predating the merge of
LOAD_METHOD
andLOAD_ATTR
Currently the following instructions have multiple entry caches
STORE_ATTR
is only about 1% of instructions (dynamically) so let's not worry about that for now.BINARY_SUBSCR
is also uncommon, at about 2% of instructions (dynamically), but it would be nice to merge it withBINARY_OP
so it is worth a look. Calls are complicated, so the cache use ofCALL
is probably OK, butCALL
s do represent about 6% of instructions (dynamically) so it wouldn't do any harm to shrink the cache use a bit.Which makes
LOAD_GLOBAL
andLOAD_ATTR
the main candidates for shrinking.There are two main consumers of space in the caches: version numbers and object references. We should look to shrink these to 16 bit, or even 8 bit, values.
Shrinking version numbers
The only way to shrink the space for a version is to reduce the range of possible versions. We can do this by limiting the allowed number of versions, which makes it more likely that we run out of versions, or by partitioning the number space so that more commonly used version inhabit a smaller range.
Limiting consumption of versions
If we are using very large numbers of versions it is likely that a few misbehaved objects are consuming the majority of the versions. We can fix that by limiting the number of versions that a single object can consume. E.g. we could say that an individual class can only consume one hundred versions, thereafter it gets no new version number. Rapidly changing classes are difficult to specialize for, so this should be no loss. Similar reasoning applies to function and dictionary key versions.
Dictionary key versions
The most heavily used keys is the keys of the builtins module. Depending on whether it is feasible to restrict all keys version to 64k, we can partition the key versions as follows:
Type versions
For specialization of binary operators, it is probably worth reserving some fixed versions for common types like
int
and reserving a few more numbers for immutable classes with custom operators, likedecimal
orndarray
. Other than that, our best approach is to try to restrict all versions to 64k.First of all we want to avoid allocating version numbers to classes that are being constructed, to avoid consuming versions at that point. I don't think this is currently an issue, but it is something that we should guard against. Next we want to prevent a few classes consuming all the versions and prevent other classes from getting valid (or good) version numbers. We can do that by rationing the number of versions that an individual class can have (to 50?). That way, if we have a program with 10,000 well behaved classes (consuming 3 versions each) and a 100 misbehaving classes we only need 35,000 version numbers, fitting in 64k.
Function versions
We already restrict the function version usable for
BINARY_SUBSCR
to 16 bit. The problem with functions is that there could be many of them. A very large program might include tens of thousands of functions. Functions almost never change (change is defined as having__code__
or defaults changed), so we could like the number of versions to function to a very low number (say 5), and assume that won't specialize calls to more than ~60,000 functions.We currently allocate a version number to each code object we create, so it might be better to allocate code object versions lazily when required.
Summary
We can restrict the version numbers to 16 bits, by limiting the number of versions that a single object can consume.
Shrinking object caches
By caching objects on the class, instead of inline, we can shrink the size of the object cache from 8 bytes to 1 byte. See https://github.com/faster-cpython/ideas/issues/532
LOAD_GLOBAL
LOAD_GLOBAL
has two specializations, one for lookups on the current module and one for the builtins. It isLOAD_GLOBAL_BUILTIN
that needs the most space. Currently it uses 10 bytes for its cache, 2 for the counter, 4 for the current module's dict key version, 2 for builtins module's dict key version, and 2 for the index into the builtins dict's values.We can shrink this to 6 bytes. As follows:
len(builtins.__dict__) == 157
. It is unlikely that a program will add 100 values to the builtins.LOAD_GLOBAL_MODULE
only needs one version, so can use a 16 bit index.LOAD_ATTR
The 18 byte cache of
LOAD_ATTR
consists of a 2 byte counter, a 4 bytes type version, a 4 byte dict key version, and an 8 byte pointer (chopped into 2byte chunks for alignment)Using 16 bit versions, and 1 byte index into a per-class object cache, we can shrink the cache from 18 to 7 (aligned to 8) bytes. In theory we could allow 24 bit type version numbers, although I don't know how practical that would be.