Open markshannon opened 6 months ago
Could we look at other VMs to see what they do? (Even if it's unspecified, what does the implementation do?) What does the JVM do? Or v8? Or, perhaps WASM?
This what I was suggesting in https://github.com/faster-cpython/ideas/issues/657. But I guess I didn't do a good job explaining it, also had no alignment tricks there.
Note, the two call stack approach is critical to getting true function inlining deopts working without breaking C profilers. I need a way to reconstruct the stack without inconsistency in VM state. If we can get this to work without perf loss on tier 1, it would be ideal.
The easiest layout for reconstruction would be anything outside of localsplus
in the control stack, including f_globals
and friends. While the data stack should purely be operand stack entries (localsplus and stack).
I'm assigning this to myself to work on.
I take it you discussed this with Mark and he approves? (He should, he likes tricks like this. :-)
I take it you discussed this with Mark and he approves? (He should, he likes tricks like this. :-)
No I haven't discussed this with Mark, but I will implement it according to Mark's proposal here, so I hope he likes it.
I thought about this a bit more and I realised I don't need to implement this for out-of-memory profilers to work. I thought of yet another hack to make them work with full function inlining.
I'm intrigued about your new approach...
There are two common-ish ways to lay out call stacks in programming languages.
A single stack is by far the most common and is often faster as it only needs a single stack pointer.
So why consider a two stack approach?
Adding a control stack without losing performance
Pointers
Currently, the interpreter and JIT maintain three pointers in registers:
We can add a control stack pointer without using another register with a memory layout trick. By placing the control stack at the end of the thread state, and ensuring that the thread state is aligned on a power of two boundary, the thread pointer can be found by zeroing the low bits of the control stack pointer.
tstate = control_pointer & ALIGNMENT_MASK
Calls and returns.
To make a call, VM needs to fill in all the fields of a new
_PyInterpreterFrame
. That will mostly not change, except:previous
pointer, as the control stack is a contiguous array.Overall we save 1 write for calls and a read and a write for returns.
The control stack
The simplest control stack entry is:
Adding the code object to the control stack makes life easier for profilers, especially out of process profilers, but slows down entry into generators a tad as the
code
pointer will need to be copied.Another possibility is to move all the control data onto the control stack. Having nothing but (possibly NULL) object references on the data stack should help simplify the code, and simple code is often faster code. Although, it is extra copying when entering a generator.
Note: I'm ignoring the C stack in the above discussion. If you want to consider that as well, then we are moving from two to three stacks.