Open markshannon opened 6 months ago
I performed an experiment to see how well binary ops specialize for reference count and immortality. They specialize quite well, both for the fraction specialized and the hit rates. Refcount stability seems quite good.
Numbers are in millions of instructions. Full stats
1
means refcount of 1, I
means immortal, X
means anything.
Op | Count | Miss |
---|---|---|
BINARY_OP_II | 2310 | 1.9% |
BINARY_OP_XI | 2265 | 0.5% |
BINARY_OP_XX | 1596 | -- |
BINARY_OP_X1 | 623 | 0.3% |
BINARY_OP_11 | 550 | 2.4% |
BINARY_OP_1I | 425 | 5.4% |
BINARY_OP_1X | 354 | 0.6% |
BINARY_OP_IX | 155 | 13.6% |
BINARY_OP_I1 | 113 | 16.6% |
So if we specialize just for refcounts, we will want to specialize for II
, XI
, and X1
as they are common and have low miss rates.
There is little to gained from the 11
form, as we can't reuse both operands. So, we might as well merge into X1
.
1X
allows the left hand side to used, and has a low miss rate. This is probably worth specializing for despite its relative rarity.
We might want to implement 1I
as well, despite its miss rate and rarity, just because it is so efficient.
All the others probably aren't worth it, so we will want to have four or five specializations, II
, XI
, X1
, 1X
and maybe 1I
.
If we are specializing for refcount and type, we will obviously need to use some sort of table approach to handle the many combinations. We will need to guard on the refcounts, and on the types. Guarding on refcount or immortality is tricky to do with table lookup, without some sort of switch, so we will need to do that at the bytecode level. Type checks can be done without excessive branching using versions.
Both types of guards can be potentially removed in the tier 2 optimizer.
We can split specialized BINARY_OP
s into two parts, the operation and cleanup. The operation needs to be aware of refcount == 1, so it can reuse object, but doesn't care about immortal objects. The cleanup does care about immortal objects, so it can avoid decrefing them.
For the operations we care about types and where the left or right side has refcount of 1.
Here's the data for all BINARY_OP
s when running the benchmarks:
Key:
Any
is some type other than the ones listed.
Top 100 most common type/op/type/refcount quads, including indexing and comparisons:
type(left) op type(right) (refcounts) |
Count | Percentage |
---|---|---|
int + int (XX) |
542,718 | 8.7% |
int += int (XX) |
517,690 | 8.3% |
int cmp int (XX) |
474,510 | 7.6% |
list[int] (XX) |
328,388 | 5.3% |
int - int (XX) |
327,856 | 5.3% |
int & int (XX) |
317,742 | 5.1% |
float * float (XX) |
299,114 | 4.8% |
str + str (XX) |
262,912 | 4.2% |
str cmp str (XX) |
259,269 | 4.2% |
array[int] (XX) |
157,788 | 2.5% |
tuple[int] (XX) |
154,954 | 2.5% |
Any cmp Any (XX) |
125,920 | 2.0% |
int * int (XX) |
116,567 | 1.9% |
dict[str] (XX) |
116,100 | 1.9% |
Any[tuple] (XX) |
108,601 | 1.7% |
Any[str] (XX) |
87,382 | 1.4% |
float + float (1X) |
79,610 | 1.3% |
str[int] (XX) |
79,145 | 1.3% |
float += float (X1) |
69,860 | 1.1% |
int -= int (XX) |
68,934 | 1.1% |
Any[int] (XX) |
60,122 | 1.0% |
int + int (1X) |
57,471 | 0.9% |
float - float (1X) |
51,345 | 0.8% |
str + str (1X) |
49,942 | 0.8% |
str += str (XX) |
47,826 | 0.8% |
list cmp list (XX) |
44,856 | 0.7% |
float * float (1X) |
43,470 | 0.7% |
list + list (1X) |
41,348 | 0.7% |
list[slice] (XX) |
41,247 | 0.7% |
float - float (XX) |
40,492 | 0.6% |
int // int (XX) |
37,583 | 0.6% |
int >> int (XX) |
37,153 | 0.6% |
float cmp int (XX) |
36,250 | 0.6% |
tuple cmp tuple (XX) |
32,277 | 0.5% |
str + str (X1) |
31,917 | 0.5% |
bytes[int] (XX) |
31,823 | 0.5% |
Any cmp str (XX) |
31,383 | 0.5% |
int << int (XX) |
31,056 | 0.5% |
int \| int (XX) |
30,969 | 0.5% |
str += str (X1) |
30,450 | 0.5% |
str % str (XX) |
27,202 | 0.4% |
str % str (X1) |
26,063 | 0.4% |
bytes cmp bytes (XX) |
25,275 | 0.4% |
Any += Any (XX) |
24,653 | 0.4% |
float -= float (X1) |
24,300 | 0.4% |
int - int (1X) |
23,753 | 0.4% |
int // int (1X) |
21,177 | 0.3% |
int \|= int (XX) |
20,921 | 0.3% |
float += float (1X) |
20,760 | 0.3% |
int & int (1X) |
20,517 | 0.3% |
int % int (XX) |
19,170 | 0.3% |
int * int (X1) |
18,809 | 0.3% |
int ^ int (XX) |
18,492 | 0.3% |
list += list (XX) |
17,259 | 0.3% |
bytes + bytes (XX) |
17,208 | 0.3% |
Any cmp list (XX) |
17,080 | 0.3% |
list * int (1X) |
16,994 | 0.3% |
Any[Any] (XX) |
16,261 | 0.3% |
str cmp Any (XX) |
15,498 | 0.2% |
dict[int] (XX) |
15,133 | 0.2% |
str * int (XX) |
14,804 | 0.2% |
int ^ int (1X) |
14,544 | 0.2% |
Any - Any (XX) |
13,385 | 0.2% |
float / int (X1) |
13,340 | 0.2% |
float /= float (XX) |
12,300 | 0.2% |
int cmp Any (XX) |
11,997 | 0.2% |
tuple += tuple (X1) |
11,174 | 0.2% |
str % int (XX) |
11,043 | 0.2% |
tuple + tuple (XX) |
10,989 | 0.2% |
float / int (1X) |
10,631 | 0.2% |
int - float (XX) |
10,560 | 0.2% |
float * int (1X) |
9,734 | 0.2% |
int &= int (XX) |
9,660 | 0.2% |
frozen_set cmp frozen_set (XX) |
9,021 | 0.1% |
int <<= int (XX) |
8,910 | 0.1% |
dict[Any] (XX) |
8,673 | 0.1% |
list + list (X1) |
8,639 | 0.1% |
Any \|= Any (XX) |
8,512 | 0.1% |
float * float (X1) |
8,480 | 0.1% |
list + list (XX) |
8,325 | 0.1% |
float ** float (1X) |
8,120 | 0.1% |
Any[slice] (XX) |
7,946 | 0.1% |
float cmp float (XX) |
7,831 | 0.1% |
Any * Any (XX) |
7,740 | 0.1% |
str % tuple (X1) |
7,317 | 0.1% |
tuple + tuple (1X) |
7,235 | 0.1% |
float -= float (1X) |
6,600 | 0.1% |
Any cmp int (XX) |
6,411 | 0.1% |
int[int] (XX) |
6,269 | 0.1% |
int * float (XX) |
5,789 | 0.1% |
tuple + tuple (X1) |
5,725 | 0.1% |
float + float (XX) |
5,709 | 0.1% |
str[slice] (XX) |
5,644 | 0.1% |
bytearray += bytes (XX) |
5,643 | 0.1% |
list += list (X1) |
5,615 | 0.1% |
int ^= int (XX) |
5,591 | 0.1% |
dict[bytes] (XX) |
5,479 | 0.1% |
set cmp set (XX) |
5,388 | 0.1% |
float * int (XX) |
5,012 | 0.1% |
int + int (X1) |
4,980 | 0.1% |
By specializing
BINARY_OP
by reference count as well as by type we can potentially speedup tier 2 jitted binary operations quite a lot.It is quite common that one of the operands of
BINARY_OP
has a reference count of 1, so it is profitable to avoid object allocation and reuse that object. We already do that forfloat
specializations ofBINARY_OP
. The problem is that the resulting code is quite branchy, and thus bulky, which is a problem for the jit.If we specialize by refcount, the code size shrinks dramatically. For example, the x86-64 code size for
_BINARY_OP_MULTIPLY_FLOAT
is 232 bytes. If we specialize for the left hand side having a refcount of 1 the equivalent instruction is 143* with a guard for the refcount. If we assume that the optimizer can remove the guard, then the size drops to 74 bytes.If we also specialize for the right hand side having a refcount > 1, then the code size (without guards) drops to 41 bytes, less than 20% of the original code size.
Pyston performs this optimization. It is, I believe, the only specialization that Pyston does that we don't.
There isn't much room for more tier 1 instructions, but by combining this approach with https://github.com/faster-cpython/ideas/issues/660 we can implement this without excessive tier 1 specialization.
Currently,
BINARY_OP
specializes to one of:We can replace these with:
The
L1
andR1
suffices mean "left has refcount of one" and "right has refcount of one" respectively.int
is still worth treating separately because it is both common and more awkward due to the caching of small values.(ignoring BINARY_OP_INPLACE_ADD_UNICODE as a special case)
*This will go down once we have hot/cold splitting