faster-cpython / ideas

1.67k stars 49 forks source link

Optimization of valid index calculation #498

Open eendebakpt opened 1 year ago

eendebakpt commented 1 year ago

In listobject.c there is an optimization to check whether an index is valid (e.g. 0 <= index < N) using a single comparison. The same optimization is not used in other files such as tupleobject.c (except for _collectionsmodule.c)

I created a branch with the idea for the tupleobject.c since that might be the most performance critical:

https://github.com/python/cpython/compare/main...eendebakpt:cpython:list_tuple

I could not measure significant performance improvements on my system, but perhaps there are other systems/compilers where there is a measurable difference.

brandtbucher commented 1 year ago

If we do this, we should just define a nice, readable macro for it in pymacro.h and stop reinventing the wheel everywhere. We do similar moves, for example, all over bytecodes.c.

eendebakpt commented 1 year ago

If we do this, we should just define a nice, readable macro for it in pymacro.h and stop reinventing the wheel everywhere. We do similar moves, for example, all over bytecodes.c.

@brandtbucher Do you mean bytesarray.c or bytesobject.c? In those I can find index checks, but not in bytecodes.c.

Putting a macro in pymacro.h seems like a good idea. A real macro, or an inline function then? Such a change would make the code more consistent, but it would also mean many (very small) changes in the codebase.

gvanrossum commented 1 year ago

I think Brandt is referring to things like

            Py_ssize_t signed_magnitude = Py_SIZE(sub);
            DEOPT_IF(((size_t)signed_magnitude) > 1, BINARY_SUBSCR);

which I found at line 386 in Python/bytecodes.c.

eendebakpt commented 1 year ago

That is indeed a similar trick! I found 3 occurrences in bytecodes.c, the code then changes from:

DEOPT_IF(((size_t)signed_magnitude) > 1, BINARY_SUBSCR);

to

DEOPT_IF(invalid_index(signed_magnitude, 2), BINARY_SUBSCR);
JelleZijlstra commented 1 year ago

Does this actually make a difference in practice? Can't the compiler already do this optimization?

eendebakpt commented 1 year ago

Does this actually make a difference in practice? Can't the compiler already do this optimization?

The PR is created is more about consistency than optimization. With optimizations on there is a difference in the generated code (an extra test %rsi,%rsi, see details below), but I do not think it makes a large difference (perhaps only in the tuple/list indexing or the usage in bytecodes.c)

Output of `gdb -batch -ex "disassemble/rs tupleitem" ./python` Main build with optimizations on: ``` Dump of assembler code for function tupleitem: Objects/tupleobject.c: 363 { 0x0000000000244e50 <+0>: f3 0f 1e fa endbr64 364 if (i < 0 || i >= Py_SIZE(a)) { 0x0000000000244e54 <+4>: 48 85 f6 test %rsi,%rsi 0x0000000000244e57 <+7>: 78 10 js 0x244e69 ./Include/object.h: 144 return var_ob->ob_size; 0x0000000000244e59 <+9>: 48 3b 77 10 cmp 0x10(%rdi),%rsi 0x0000000000244e5d <+13>: 7d 0a jge 0x244e69 Objects/tupleobject.c: 368 return Py_NewRef(a->ob_item[i]); 0x0000000000244e5f <+15>: 48 8b 44 f7 18 mov 0x18(%rdi,%rsi,8),%rax ./Include/object.h: 641 Py_INCREF(obj); 0x0000000000244e64 <+20>: 48 83 00 01 addq $0x1,(%rax) 642 return obj; 0x0000000000244e68 <+24>: c3 retq Objects/tupleobject.c: 365 PyErr_SetString(PyExc_IndexError, "tuple index out of range"); 0x0000000000244e69 <+25>: 50 push %rax 0x0000000000244e6a <+26>: 48 8b 3d 8f 52 28 00 mov 0x28528f(%rip),%rdi # 0x4ca100 0x0000000000244e71 <+33>: 48 8d 35 fa e5 12 00 lea 0x12e5fa(%rip),%rsi # 0x373472 0x0000000000244e78 <+40>: e8 13 17 05 00 callq 0x296590 366 return NULL; 0x0000000000244e7d <+45>: 31 c0 xor %eax,%eax 0x0000000000244e7f <+47>: 5a pop %rdx 0x0000000000244e80 <+48>: c3 retq End of assembler dump. ``` PR build with optimizations on: ``` Dump of assembler code for function tupleitem: Objects/tupleobject.c: 363 { 0x0000000000245820 <+0>: f3 0f 1e fa endbr64 ./Include/pymacro.h: 171 return (size_t)i >= (size_t)limit; 0x0000000000245824 <+4>: 48 39 77 10 cmp %rsi,0x10(%rdi) 0x0000000000245828 <+8>: 76 0a jbe 0x245834 Objects/tupleobject.c: 368 return Py_NewRef(a->ob_item[i]); 0x000000000024582a <+10>: 48 8b 44 f7 18 mov 0x18(%rdi,%rsi,8),%rax ./Include/object.h: 646 Py_INCREF(obj); 0x000000000024582f <+15>: 48 83 00 01 addq $0x1,(%rax) 647 return obj; 0x0000000000245833 <+19>: c3 retq Objects/tupleobject.c: 365 PyErr_SetString(PyExc_IndexError, "tuple index out of range"); 0x0000000000245834 <+20>: 50 push %rax 0x0000000000245835 <+21>: 48 8b 3d c4 48 28 00 mov 0x2848c4(%rip),%rdi # 0x4ca100 0x000000000024583c <+28>: 48 8d 35 2f dc 12 00 lea 0x12dc2f(%rip),%rsi # 0x373472 0x0000000000245843 <+35>: e8 68 17 05 00 callq 0x296fb0 366 return NULL; 0x0000000000245848 <+40>: 31 c0 xor %eax,%eax 0x000000000024584a <+42>: 5a pop %rdx 0x000000000024584b <+43>: c3 retq End of assembler dump. ``` System: linux, gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)