Direct support for entangled untagged pointers in the IL

dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.

https://dart.dev

BSD 3-Clause "New" or "Revised" License

10.08k stars 1.56k forks source link

Direct support for entangled untagged pointers in the IL #54884

Open mraleph opened 7 months ago

mraleph commented 7 months ago

I am going to call a raw untagged pointer entangled with a tagged pointer if it is either:

derived from the tagged pointer (i.e. it is an inner pointer into a Dart object) or
is pointing to a region in native memory which has its lifetime tied to lifetime of the Dart object.

Currently Dart VM compiler IL has no support for expressing entanglement of pointers, which makes it fragile:

Inner pointers can't be live across GCs, because GC is unable to update them.
We must manually ensure that that compile time lifetime of the GC managed object encompasses compile-time lifetime of the entangled raw pointer.

This causes issues like https://github.com/dart-lang/sdk/issues/54710 - which we attempt to fix by tweaking optimizations to try respect these invariants - but this does not solve the underlying fragility. Every time we add an optimization we must consider the implications on these invariants - and I think this is bound to repeatedly cause us troubles down the line. Thus, in my opinion, a better approach is to make changes to the pipeline which will eliminate the need for us to actively care about these invariants and make compiler and/or runtime maintain these invariants for us.

Let us take a look at the general shape of the problem:

// We have a pointer to the Dart object managed by GC
objectPointer = /* ... */

// We have a raw untagged pointer derived from Dart
// object. For example, we take a TypedData object
// and load it's data pointer (which gives us an inner
// pointer or an external pointer - but with lifetime
// tied to lifetime of TypedData object), or we simply
// erase object tag and add some offset.
untaggedPtr = DeriveRawPointerOp(objectPointer)

// We have a bunch of arithmetic operations which 
// derive another pointer from the base |untaggedPtr|
// this includes adding offsets and indexing. 
derivedPtr = DoSomeArithmetic(untaggedPtr, v0, ..., vN)

// We have an instruction which can cause a GC
PossibleGC()

// We use derived pointer. This can be a load or store operation
// or it can be an FFI call which uses derivedPtr. It can even be 
// some more arithmetic to produce another derived pointer from it. 
UseDerivedPointerOp(derivedPtr, ...)

One obvious way to address the problem here is to do what we would do manually when writing code in our runtime by hand: we could write a compiler pass which finds entangled pointers which are invalidated by possible GC points and then duplicates DeriveRawPointerOp + DoSomeArithmetic after each GC point. This however would bloat the code - especially when PossibleGC instruction is something that only causes GC on slow path (e.g. consider the case of our stack overflow guard in the loop header).

So instead I propose the following approach:

We write a compiler pass which ensures that compile time lifetimes of entangled pointers are encompassed by compile time lifetimes of the corresponding Dart object. I think this can be achieved simply by inserting ReachabilityFence(objectPointer) after each operation which uses derivedPtr.
For inner pointers:
1. We extend stack maps to allow us to mark entangled pointers on the stack and describe which object pointer they are entangled with
2. We extend GC to update entangled pointers on the stack

These two changes not only future proof our IL removing the need to worry about maintaining invariants related to entangled pointers in all optimizations, but will enable new classes of optimizations, consider for example the code that iterates a list:

final list = Float64List(10);
var sum = 0.0;
for (var i = 0; i < list.length; i++) {
  final v = list[i];
  sum += v;
}
print(sum);

Currently list[i] produces the following code on ARM64:

        ;; v42 <- LoadIndexed(v28 T{_Float64List}, v7 T{int}) double
0x10593329c    8b010c10               add tmp, r0, r1 lsl #3
0x1059332a0    fc417201               fldrd v1, [tmp, #23]

This is caused by our inability to fold indexing and load together - because ARMv8 instruction set does not have addressing mode expressive enough for that. Instead, if we had proper support for derived pointers in the IL, we could generate code which uses derived inner pointer as an iteration variable.

mraleph commented 7 months ago

/cc @mkustermann @alexmarkov @sstrickl

mkustermann commented 7 months ago

@alexmarkov and I were talking about this as well recently. There's a question whether we believe making this infrastructure is worthwhile given the very limited use of inner pointer atm and conversely if we had the infrastructure where else we could make use of it. i.e. is it worthwhile to build this?

sstrickl commented 7 months ago

We saw major improvements in TypedData (especially external typed data) and Pointer benchmarks when we made the original changes that allowed loads of untagged values that are not Dart pointers to be optimized across basic blocks, lifted out of loops, etc., and also allowed removal of intermediate external typed data and Pointer allocations. However, for untagged pointers to managed memory, like the data fields of internal typed data objects and typed data views, we can't do those optimizations (and do not, currently, as each load of a untagged pointer to possibly managed memory right now is not considered a valid candidate for load optimization).

If there was no way to turn untagged pointers to managed memory into unboxed integers, then we'd be fine. (It's fine to convert untagged pointers to unmananged memory to and from unboxed integers, like when the stored address in a Pointer object is retrieved or a Pointer object is constructed.) However, the problem is that in the IL we support converting arbitrary untagged pointers to unboxed integers in order to, say, initialize the data field of a typed data view by retrieving the untagged pointer to the original typed data (which may or may not be external), converting it to an unboxed integer, perform arithmetic on it, and then convert it back to untagged and then initialize the data field with the untagged value.

Ignoring the FFI for now, this is the only use case for performing arithmetic on untagged pointers to managed memory, so it'd be easy enough to abstract out into an instruction (either a general one or a view-specific one) that converts an untagged pointer to another untagged pointer, and thus guarantee that the output untagged pointer is either not to managed memory (if the original untagged pointer was retrieved from a guaranteed external typed data object or Pointer) or possibly to managed memory (if the original object's specific class isn't known or is known to be an internal typed data object or view), and then we can guarantee that untagged pointers to managed memory don't escape into unboxed integers that may flow arbitrarily, and can have the same restrictions (or optimizations) for those untagged pointers depending on their original provenance, and this doesn't require changes to the GC, just changes to the optimizing passes that may insert or move GC-triggering instructions (like DelayAllocations (allocations) and SelectRepresentations (boxing)).

However, it might be the case that the FFI uses untagged pointers to memory that is possibly (or definitely) managed by the Dart VM in ways that can't guarantee they don't escape, and we'd need to figure out that's necessary or could be fixed so that they don't there either. (For example, I know the FFI currently converts untagged pointers to unboxed integers, but if those unboxed integers aren't being used to perform arithmetic and the like, there's no reason to convert them to unboxed integers in the first place, now that we allow untagged values on the stack, just leave them as untagged and change instructions like CCall to expect untagged pointers where appropriate.) /cc @dcharkes

sstrickl commented 7 months ago

Oh, and the reason we started allowing untagged values on the stack was to explicitly allow load optimization of untagged pointers to unmanaged memory, which also meant exposing the untagged pointers in places, like changing the MemoryCopy to take untagged pointers as dest and src instead of the original tagged typed data objects, so those loads (if to unmanaged memory) could have their lifetimes extended appropriately by replacing later loads of the same data field with the previous load.

sstrickl commented 7 months ago

Mostly just adding those comments to bring in the history of why this happened.

I like the two-prong approach, both in making untagged values safer to flow (if the GC can handle untagged pointers to managed memory, then we can even optimize loads of the data field from views and internal typed data objects without worry, which means we no longer need a way to distinguished untagged pointers to managed memory and ones to unmanaged memory) and in limiting what can be done with an untagged pointer (by making an instruction which can perform the arithmetic operations needed to convert one untagged pointer to another untagged pointer without having to convert temporarily to unboxed integers).

If we decide to go this route, then I'm happy to volunteer for doing it.

Re: the GC, do we even need to know which pointer they're entangled with? That is, if we have an untagged pointer on the stack (which would still be rare), wouldn't it be enough to know that it's untagged? If so, then the GC would just have to keep a (small) table from objects to sets of stack locations and then if an object is moved, you check the table and if the object has an entry, you adjust the untagged pointers at those stack locations correspondingly (since they're just offsets from the start of the object). So we don't need untagged pointer provenance information in the GC, just extra information in the stackmaps to say "these positions are untagged" (and, since this is a rare case, a way of encoding it that limits the size change for stackmaps that don't involve untagged pointers).

alexmarkov commented 7 months ago

I see the following options to handle untagged inner pointers in the IL:

A. Avoid inner pointers to Dart heap objects in the IL entirely. Add corresponding canonicalization rules which would replace high-level operations taking Dart objects with low-level operations on raw pointers when it is safe (for FFI Pointer and external typed data cases):

v0 <- ... {Pointer}
v2 <- LoadIndexed(v0, ...)

v0 <- ... {Pointer}
v1 <- LoadField(v0, TypedDataBase.data) // Untagged non-heap pointer. This load can be forwarded and Pointer allocation can be eliminated.
v2 <- LoadIndexed(v1, ...)

We might also need to introduce additional high-level instructions (e.g. AllocateTypedDataView) to avoid manipulations with inner pointers. As we would never create inner pointers in the IL, all optimizations don't need to care about them. No GC and stack map changes are needed. At the same time, when we can prove that derived pointer would be out of Dart heap, we can still do extra optimizations.

B. Allow inner pointers in the IL, but only after certain pass. At first, IL would only contain high level instructions taking Dart objects. After certain high-level optimization passes which can move instructions (such as LICM , DelayAllocations), we can lower IL to low-level instructions which can operate with inner pointers. But moving arbitrary instructions would be disallowed at this point. Note that we already have passes like EliminateWriteBarriers which assume that instructions cannot be moved freely and GC cannot occur at any moment.

This option is more fragile than (A), but still maintainable with enough assertions and flow graph checker code. The advantage is that we would be able to represent inner pointer accesses even for Dart heap objects. However, I'm not sure what it would give us in terms of additional optimizations, maybe not much.

C. Allow inner pointers in the IL at any moment, but still disallow GC between def and use of an inner pointer. This is even more fragile option. We can make it less fragile by adding special EnterNoGCRegion and LeaveNoGCRegion instructions, limiting inner pointer defs and uses to no-GC regions and only allowing those regions to be within basic block. All optimizations should still care about moving instructions into no-GC regions, but at least those regions would be explicitly represented in the IL.

D. Allow inner pointers in the IL at any moment, and allow GC at any moment. This is the option which @mraleph described above in more details. We would need to collect and add extra metadata to stack maps, which may increase AOT snapshot size and would add a lot of complexity to the VM. In my opinion, current optimizations involving untagged pointers don't justify this additional complexity. However, this option may be appealing in order to support future optimizations.

I'm currently leaning towards option (A). Options (B) and (C) are inherently fragile so they are probably fine only as short-term solutions (but I could be wrong and they might work fine in practice). Option (D) looks too complicated and requires more ideas for possible optimizations to outweigh additional complexity and increased AOT snapshot size; maybe it is worth drafting a prototype which would prove that it can provide substantial performance improvements.

dcharkes commented 7 months ago

A. Avoid inner pointers to Dart heap objects in the IL entirely. Add corresponding canonicalization rules which would replace high-level operations taking Dart objects with low-level operations on raw pointers when it is safe (for FFI Pointer and external typed data cases): [...] We might also need to introduce additional high-level instructions (e.g. AllocateTypedDataView) to avoid manipulations with inner pointers. As we would never create inner pointers in the IL, all optimizations don't need to care about them. No GC and stack map changes are needed. At the same time, when we can prove that derived pointer would be out of Dart heap, we can still do extra optimizations.

I think this could work. We would need make the instructions that operate on unboxed addresses or typed-data-base objects to be able to work on both untagged address and tagged object (RequiredInputRepresentation, EmitNativeCode, etc.). The instructions that pop to mind for FFI are load/store indexed and the FFI call instruction (which can take either pointers or typed datas to pointer-arguments in leaf calls).

B. Allow inner pointers in the IL, but only after certain pass.

I'm not sure that this would enable us to optimize the FFI cases where the optimizer can prove that we have a pointer or external typed data. Because we would always need to write the IL in kernel_to_il.cc in a way in which something could have been an internal typed data. Could you elaborate?

C. [...] adding special EnterNoGCRegion and LeaveNoGCRegion instructions,

👍 to adding explicit regions instead of inferring them from whether an untagged value is on the stack if we were to go this route.

maybe it is worth drafting a prototype which would prove that it can provide substantial performance improvements.

For option A. one of my questions would be indeed what the performance drawbacks are. One thing I could think of is that if an internal typed data is used multiple times in a no-gc-block in IL, we end up having to load the data address field multiple times. This would hurt performance when for example writing a loop that would fill such a typed data with data. But how big the performance difference would be, we'd need to measure.

sstrickl commented 7 months ago

A. Avoid inner pointers to Dart heap objects in the IL entirely. Add corresponding canonicalization rules which would replace high-level operations taking Dart objects with low-level operations on raw pointers when it is safe (for FFI Pointer and external typed data cases)...

We might also need to introduce additional high-level instructions (e.g. AllocateTypedDataView) to avoid manipulations with inner pointers. As we would never create inner pointers in the IL, all optimizations don't need to care about them. No GC and stack map changes are needed. At the same time, when we can prove that derived pointer would be out of Dart heap, we can still do extra optimizations.

I like this idea as well, but with a caveat: there may be places in the various FlowGraphBuilders that we know that only Pointers and/or external typed data objects flow into certain places (e.g., certain FFI operations), so it still makes sense at the FlowGraphBuilder level to write the separate load untagged field outside of the use[^1]. Otherwise yeah, the VM doesn't write the separate load, just passing the tagged value into the instruction instead, and relies on canonicalization to notice that the incoming tagged object is guaranteed to be a Pointer or external typed data object and thus it's safe to extract that out at the IL level.

[^1]: Note that all Pointer objects share the same cid, so this would only be really necessary (in the usual case where we use the CompileType being a specific cid to prove the canonicalization safe) if we had an operation that could take external typed data objects of arbitrary element type, but was guaranteed to never get internal typed data objects or views. I don't know if we have such an operation right now, as I think on the FFI side it's either Pointer-only or arbitrary PointerBase subclasses, including views and internal typed data objects, so maybe needing to write this explicitly in the FlowGraphBuilders is moot.

If we do that, then no untagged value on the stack points to managed memory, and so we could load optimize any load of untagged values, including LoadThread and LoadUntagged. Well, for functions where the generated code is guaranteed to be run on the same thread (for LoadUntagged as well, since after the introduction of a slot for PointerBase.data, all LoadUntagged uses are on untagged pointers nowadays, and the initial untagged pointer used in all cases, I think, is retrieved via LoadThread. (Unless we have a way to take a currently executing non-async function and move it from one thread to another, in which case we can't load optimize either instruction in any case.)

mraleph commented 7 months ago

Let us reduce the choice space to just to options: A and D because I think we all acknowledge that B and C don't really cut it for various reasons and are presented just to map out the territory.

Option A is indeed simplest to implement. Rephrasing it as I understand it (so that we are on the same page):

When building IL always initially hide pointers inside macro-instructions
Lower macro-instructions when it is safe to do so:
- early for out of heap pointers
- late for entangled pointers

I think this will work and it is indeed simpler than option D. However D:

Enables the new class of optimizations;
Resolves the problem fully. With A we can still make a mistake if we lower at the wrong moment.

A is simpler - but I don't believe it is significantly simpler. That's why I believe D is the path forward.

dcharkes commented 7 months ago

@sstrickl regarding Pointer and ExternalTypedData being optimized but InternalTypedData not, can we also optimize TypedDataViews on ExternalTypedData? Or is our current machinery not clever enough to deal with kMayBeInnerPointer and a view on external typed data?

As @mraleph is saying, if we were to do option D, then all of that doesn't matter, we can optimize everything. This would also take my worry away of hitting a performance cliff when FFI types are both used with pointers and internal typed datas (e.g. structs by value).

sstrickl commented 7 months ago

I'm happy with both approaches. In fact, I'd suggest doing A (which is implementable in a way that shouldn't degrade performance from the status quo, since we can still just extract the data field manually in the cases where we know it's safe already instead of leaving that to the canonicalizer) to fix the issues we see in https://github.com/dart-lang/sdk/issues/54710 in the short term, with D being a longer-term, more involved followup that enables better optimization of all typed data usage in the long run, not just Pointer/external typed data objects, and also allow us to remove the tagged/untagged distinction in the LoadIndexed/StoreIndexed and MemoryCopy (after A) instructions and thus simplify their implementations.

How does that sound?

sstrickl commented 7 months ago

@sstrickl regarding Pointer and ExternalTypedData being optimized but InternalTypedData not, can we also optimize TypedDataViews on ExternalTypedData? Or is our current machinery not clever enough to deal with kMayBeInnerPointer and a view on external typed data?

The current machinery is not, and it would be tricky to do so–if the view isn't allocated locally, then we don't know if the backing typed data of the view. If it is allocated locally, then the data field is calculated in a known way that we could look for, but that'd be brittle.

However, if we had an instruction for abstracting the calculation of the view's data field from the backing typed data's data field, then we could work with that (if the backing typed data is known to be external, then we could generate an appropriate place for the converted data field that aliases the portion of the external typed data being viewed for load optimization purposes, and if it's not, then we disallow load optimization).

sstrickl commented 7 months ago

With A we can still make a mistake if we lower at the wrong moment.

This I'm not worried about, because the compiler would only lower when the definition of the tagged value has a CompileType that states it is a Pointer or external typed data object, to ensure that only untagged pointers to unmanaged memory can flow between instructions. If it turns out not to be one of those, then that's an unrelated bug in our optimizing compiler for stating the definition had a known concrete type when it turns out it didn't.

sstrickl commented 7 months ago

Oh, I see, you were talking about pulling out the tagged input to the instructions for untagged pointers late. I was just assuming that we'd have the instructions in question take either tagged or untagged inputs and let their assembly generation handle doing the right thing, like how LoadIndexed and StoreIndexed currently operate, so you'd never pull out the untagged loads for entangled pointers.

sstrickl commented 7 months ago

... oh, and all the existing instructions that take the data field as an untagged pointer already handle tagged objects or untagged pointers fine right now. So A would just be

adding the abstraction for TypedDataView data field calculation (either as an allocation or a separate instruction for field initialization)
adding the canonicalization step that pulls out the load of the untagged pointer when we know the target is a Pointer or external typed data, and
changing the FlowGraphBuilder IL generation to use LoadNativeField(Slot::PointerBase_data()) only when we know there is no possibility of an internal typed data object or a view.

So yeah, I think A then D is the right way to go.

sstrickl commented 7 months ago

However, if we had an instruction for abstracting the calculation of the view's data field from the backing typed data's data field, then we could work with that (if the backing typed data is known to be external, then we could generate an appropriate place for the converted data field that aliases the portion of the external typed data being viewed for load optimization purposes, and if it's not, then we disallow load optimization).

Okay, not quite true: it would mean that loads/stores using this untagged pointer would be allocated a place that would be ensure they were considered as affecting the same memory as corresponding loads/stores from the original untagged pointer with adjusted offset/size. (Even if you can't do load optimization because, say, the element sizes are different, you'd still need to know that stores to one would kill load(s) of the corresponding element(s) from the other since they'd no longer be valid.)

alexmarkov commented 7 months ago

@mraleph

Option A is indeed simplest to implement. Rephrasing it as I understand it (so that we are on the same page):

When building IL always initially hide pointers inside macro-instructions

Lower macro-instructions when it is safe to do so:

early for out of heap pointers

late for entangled pointers

In option A, we don't need to lower instructions at all if they operate with inner pointers into Dart heap. We would just keep those instructions in the high-level form (e.g. LoadIndexed taking a Dart object) and handle access to inner pointers during code generation. So we would never have raw inner pointers into Dart heap objects in the IL, only in the generated code within a single IL instruction.

I still think that option A is significantly simpler then D. It is almost the current state minus instructions operating on inner pointers plus new high-level instructions to cover existing cases where we previously used inner pointers, so it would be around changing LoadIndexed, StoreIndexed, LoadField and adding AllocateTypedDataView (this list may be incomplete, maybe I'm missing something). Option A also doesn't add more complexity to other parts of the VM (e.g. GC and metadata/snapshot writing).

mraleph commented 7 months ago

In option A, we don't need to lower instructions at all if they operate with inner pointers into Dart heap.

This means we would have duplicated code and macro instructions living all the way to the code generation - I think this is opposite of where we need to be taking our backend, which should be doing code generation from smaller instructions.

I am not against implementing A - but I do believe that we really want to have D long term.

sstrickl commented 5 months ago

So with all the CLs that landed today, A should be done.

Until intra-block code motion is done, we don't expose untagged GC-movable addresses except in a couple of very specific cases (TypedDataView factories and the call to C memmove for large MemoryCopy operations).
A new pass right before FinalizeGraph extracts the data field as an untagged address and rebinds the appropriate input for LoadIndexed, StoreIndexed, and MemoryCopy unless the array is known at compile time to be an internal typed data object (in which case the address of the payload is an offset from the object address and the data field does not need to be loaded).
We have a new method on definitions, MayCreateUnsafeUntaggedPointer, that defaults to true for Definitions with untagged representation() and false otherwise (a very conservative default).
LoadField of PointerBase.data returns true for MayCreateUnsafeUntaggedPointer unless the array has a CompileType of a external array class id or the LoadField was created with an explicit InnerPointerAccess::kCannotBeInnerPointer because the flow graph builder was constructing code for a recognized method that is known to only be called on external arrays. (There is a separate MayCreateUntaggedAlias method on LoadField for use in the redundancy eliminator which is similar, but also returns false if the array is known at compile time to be a view, since views can never alias themselves.)
We no longer convert untagged GC-movable addresses to unboxed integers at any point, thanks to a new instruction CalculateElementAddress that takes an untagged address, index, index scale, and offset and returns a new untagged address. The result of CalculateElementAddress is an unsafe pointer iff the base address is.
To ensure we don't add any new conversions,IntConverter now asserts that if from is kUntagged,
```
!value()->definition()->MayCreateUnsafeUntaggedPointer()
```
We no longer allow LoadUntagged on tagged values, only on untagged values, to ensure proper use of LoadField with an appropriate Slot for retrieving untagged address fields from tagged objects. We also require that the input to LoadUntagged is not an unsafe untagged pointer (and thus its result also is not one).
We no longer add a slot/input to MaterializeObject for the data field of typed data views, and instead recompute the data field in DeferredObject::Fill() after the other fields have been filled during deoptimization. (We still do for Pointers/external typed data arrays, of course.)
Untagged values returned from FfiCall are never considered unsafe, since they are external pointers returned from C that are then stored in FFI Pointer objects, which are assumed to contain only external pointers. (We could come up with a bad C function that uses the embedder to do bad things and return a GC-movable address, but our compiler should be allowed a little optimism.) Similarly, NativeParameters that are untagged are also considered safe.
Untagged values returned from CCall are only considered unsafe if one of the inputs was an unsafe untagged pointer. (In theory, a C function called with CCall could cache an input and reuse it later, but since only functions we write are called with it, that's on us.)
The FlowGraphChecker now checks that unsafe untagged pointers are only used in the same basic block as their definition, are never returned (either via Return or NativeReturn), and that no GC-triggering instructions come between the definition and use. (Abstracting untagged pointer manipulation via CalculateElementAddress removed the last barriers to doing this.)

I think those are the major points of what has landed, there might be some other minor details.

dcharkes commented 5 months ago

Thanks for the write up @sstrickl! 🚀

Untagged values returned from FfiCall are never considered unsafe, since they are external pointers returned from C that are then stored in FFI Pointer objects, which are assumed to contain only external pointers. (We could come up with a bad C function that uses the embedder to do bad things and return a GC-movable address, but our compiler should be allowed a little optimism.) Similarly, NativeParameters that are untagged are also considered safe.

👍 I don't believe we can achieve such things with the embedder API. Leaf FFI calls cannot use the embedder API, and non-leaf calls don't prevent the GC from running and can only do things with handles, not with untagged addresses.