Specialize binary ops according to reference count as well as type.

markshannon commented 6 months ago

By specializing BINARY_OP by reference count as well as by type we can potentially speedup tier 2 jitted binary operations quite a lot.

It is quite common that one of the operands of BINARY_OP has a reference count of 1, so it is profitable to avoid object allocation and reuse that object. We already do that for float specializations of BINARY_OP. The problem is that the resulting code is quite branchy, and thus bulky, which is a problem for the jit.

If we specialize by refcount, the code size shrinks dramatically. For example, the x86-64 code size for _BINARY_OP_MULTIPLY_FLOAT is 232 bytes. If we specialize for the left hand side having a refcount of 1 the equivalent instruction is 143* with a guard for the refcount. If we assume that the optimizer can remove the guard, then the size drops to 74 bytes.

If we also specialize for the right hand side having a refcount > 1, then the code size (without guards) drops to 41 bytes, less than 20% of the original code size.

Pyston performs this optimization. It is, I believe, the only specialization that Pyston does that we don't.

There isn't much room for more tier 1 instructions, but by combining this approach with https://github.com/faster-cpython/ideas/issues/660 we can implement this without excessive tier 1 specialization.

Currently, BINARY_OP specializes to one of:

BINARY_OP_MULTIPLY_INT
BINARY_OP_ADD_INT
BINARY_OP_SUBTRACT_INT
BINARY_OP_MULTIPLY_FLOAT
BINARY_OP_ADD_FLOAT
BINARY_OP_SUBTRACT_FLOAT
BINARY_OP_ADD_UNICODE

We can replace these with:

BINARY_OP_TABLE
BINARY_OP_INT
BINARY_OP_TABLE_L1
BINARY_OP_INT_L1
BINARY_OP_TABLE_R1
BINARY_OP_INT_R1

The L1 and R1 suffices mean "left has refcount of one" and "right has refcount of one" respectively.

int is still worth treating separately because it is both common and more awkward due to the caching of small values.

(ignoring BINARY_OP_INPLACE_ADD_UNICODE as a special case)

*This will go down once we have hot/cold splitting

markshannon commented 6 months ago

I performed an experiment to see how well binary ops specialize for reference count and immortality. They specialize quite well, both for the fraction specialized and the hit rates. Refcount stability seems quite good.

Numbers are in millions of instructions. Full stats

1 means refcount of 1, I means immortal, X means anything.

Op	Count	Miss
BINARY_OP_II	2310	1.9%
BINARY_OP_XI	2265	0.5%
BINARY_OP_XX	1596	--
BINARY_OP_X1	623	0.3%
BINARY_OP_11	550	2.4%
BINARY_OP_1I	425	5.4%
BINARY_OP_1X	354	0.6%
BINARY_OP_IX	155	13.6%
BINARY_OP_I1	113	16.6%

So if we specialize just for refcounts, we will want to specialize for II, XI, and X1 as they are common and have low miss rates. There is little to gained from the 11 form, as we can't reuse both operands. So, we might as well merge into X1. 1X allows the left hand side to used, and has a low miss rate. This is probably worth specializing for despite its relative rarity. We might want to implement 1I as well, despite its miss rate and rarity, just because it is so efficient. All the others probably aren't worth it, so we will want to have four or five specializations, II, XI, X1, 1X and maybe 1I.

Specializing for type and refcount

If we are specializing for refcount and type, we will obviously need to use some sort of table approach to handle the many combinations. We will need to guard on the refcounts, and on the types. Guarding on refcount or immortality is tricky to do with table lookup, without some sort of switch, so we will need to do that at the bytecode level. Type checks can be done without excessive branching using versions.

Both types of guards can be potentially removed in the tier 2 optimizer.

markshannon commented 6 months ago

We can split specialized BINARY_OPs into two parts, the operation and cleanup. The operation needs to be aware of refcount == 1, so it can reuse object, but doesn't care about immortal objects. The cleanup does care about immortal objects, so it can avoid decrefing them.

For the operations we care about types and where the left or right side has refcount of 1.

Here's the data for all BINARY_OPs when running the benchmarks: Key:

(XX) -- Neither operand has refcount of 1
(1X) -- Left operand has refcount of 1
(X1) -- Right operand has refcount of 1 and the left operand has refcount > 1.

Any is some type other than the ones listed.

markshannon commented 6 months ago

Top 100 most common type/op/type/refcount quads, including indexing and comparisons:

`type(left) op type(right)` (refcounts)	Count	Percentage
`int + int` (XX)	542,718	8.7%
`int += int` (XX)	517,690	8.3%
`int cmp int` (XX)	474,510	7.6%
`list[int]` (XX)	328,388	5.3%
`int - int` (XX)	327,856	5.3%
`int & int` (XX)	317,742	5.1%
`float * float` (XX)	299,114	4.8%
`str + str` (XX)	262,912	4.2%
`str cmp str` (XX)	259,269	4.2%
`array[int]` (XX)	157,788	2.5%
`tuple[int]` (XX)	154,954	2.5%
`Any cmp Any` (XX)	125,920	2.0%
`int * int` (XX)	116,567	1.9%
`dict[str]` (XX)	116,100	1.9%
`Any[tuple]` (XX)	108,601	1.7%
`Any[str]` (XX)	87,382	1.4%
`float + float` (1X)	79,610	1.3%
`str[int]` (XX)	79,145	1.3%
`float += float` (X1)	69,860	1.1%
`int -= int` (XX)	68,934	1.1%
`Any[int]` (XX)	60,122	1.0%
`int + int` (1X)	57,471	0.9%
`float - float` (1X)	51,345	0.8%
`str + str` (1X)	49,942	0.8%
`str += str` (XX)	47,826	0.8%
`list cmp list` (XX)	44,856	0.7%
`float * float` (1X)	43,470	0.7%
`list + list` (1X)	41,348	0.7%
`list[slice]` (XX)	41,247	0.7%
`float - float` (XX)	40,492	0.6%
`int // int` (XX)	37,583	0.6%
`int >> int` (XX)	37,153	0.6%
`float cmp int` (XX)	36,250	0.6%
`tuple cmp tuple` (XX)	32,277	0.5%
`str + str` (X1)	31,917	0.5%
`bytes[int]` (XX)	31,823	0.5%
`Any cmp str` (XX)	31,383	0.5%
`int << int` (XX)	31,056	0.5%
`int \\| int` (XX)	30,969	0.5%
`str += str` (X1)	30,450	0.5%
`str % str` (XX)	27,202	0.4%
`str % str` (X1)	26,063	0.4%
`bytes cmp bytes` (XX)	25,275	0.4%
`Any += Any` (XX)	24,653	0.4%
`float -= float` (X1)	24,300	0.4%
`int - int` (1X)	23,753	0.4%
`int // int` (1X)	21,177	0.3%
`int \\|= int` (XX)	20,921	0.3%
`float += float` (1X)	20,760	0.3%
`int & int` (1X)	20,517	0.3%
`int % int` (XX)	19,170	0.3%
`int * int` (X1)	18,809	0.3%
`int ^ int` (XX)	18,492	0.3%
`list += list` (XX)	17,259	0.3%
`bytes + bytes` (XX)	17,208	0.3%
`Any cmp list` (XX)	17,080	0.3%
`list * int` (1X)	16,994	0.3%
`Any[Any]` (XX)	16,261	0.3%
`str cmp Any` (XX)	15,498	0.2%
`dict[int]` (XX)	15,133	0.2%
`str * int` (XX)	14,804	0.2%
`int ^ int` (1X)	14,544	0.2%
`Any - Any` (XX)	13,385	0.2%
`float / int` (X1)	13,340	0.2%
`float /= float` (XX)	12,300	0.2%
`int cmp Any` (XX)	11,997	0.2%
`tuple += tuple` (X1)	11,174	0.2%
`str % int` (XX)	11,043	0.2%
`tuple + tuple` (XX)	10,989	0.2%
`float / int` (1X)	10,631	0.2%
`int - float` (XX)	10,560	0.2%
`float * int` (1X)	9,734	0.2%
`int &= int` (XX)	9,660	0.2%
`frozen_set cmp frozen_set` (XX)	9,021	0.1%
`int <<= int` (XX)	8,910	0.1%
`dict[Any]` (XX)	8,673	0.1%
`list + list` (X1)	8,639	0.1%
`Any \\|= Any` (XX)	8,512	0.1%
`float * float` (X1)	8,480	0.1%
`list + list` (XX)	8,325	0.1%
`float ** float` (1X)	8,120	0.1%
`Any[slice]` (XX)	7,946	0.1%
`float cmp float` (XX)	7,831	0.1%
`Any * Any` (XX)	7,740	0.1%
`str % tuple` (X1)	7,317	0.1%
`tuple + tuple` (1X)	7,235	0.1%
`float -= float` (1X)	6,600	0.1%
`Any cmp int` (XX)	6,411	0.1%
`int[int]` (XX)	6,269	0.1%
`int * float` (XX)	5,789	0.1%
`tuple + tuple` (X1)	5,725	0.1%
`float + float` (XX)	5,709	0.1%
`str[slice]` (XX)	5,644	0.1%
`bytearray += bytes` (XX)	5,643	0.1%
`list += list` (X1)	5,615	0.1%
`int ^= int` (XX)	5,591	0.1%
`dict[bytes]` (XX)	5,479	0.1%
`set cmp set` (XX)	5,388	0.1%
`float * int` (XX)	5,012	0.1%
`int + int` (X1)	4,980	0.1%

faster-cpython / ideas

Specialize binary ops according to reference count as well as type. #662

Specializing for type and refcount