ffi: Support memory allocations in the current stack frame

I think there should be a way to allocate memory on the stack with dart:ffi, for instance with a builtin function like alloca.

As a motivation for this, let's say we had a function int load_some_string(int index, char **nameOut, int *lengthOut). When calling that in C, we'd might do

int main() {
  char *name;
  int length;
  load_some_string(0, &name, &length);
  fwrite(name, length, 1, stdout);
}

Obviously that doesn't work in Dart, so we have to use

import 'package:ffi/ffi.dart';

void main() {
  final function = library.lookupFunction<...>('load_some_string');
  final namePtr = allocate<Pointer<Uint8>>();
  final lengthPtr = allocate<Int>();

  function(0, namePtr, lengthPtr);
  print(utf8.decode(namePtr.asTypedList(lenghtPtr.value)));

  free(namePtr);
  free(lengthPtr);
}

I think this is rather annoying because

It makes the code harder to read (it quickly gets much worse with control flow or proper error handling)
It's easy to forget the free and leak memory
I would also assume that reserving memory in the current frame is much faster than malloc/free. Obviously we can re-use those pointers but that also amplifies the other problems.

This would be much simpler if we could allocate memory directly on the stack:

void main() {
  final function = library.lookupFunction<...>('load_some_string');
  final namePtr = alloca<Pointer<Uint8>>();
  final lengthPtr = alloca<Int>();

  function(0, namePtr, lengthPtr);
  print(utf8.decode(namePtr.asTypedList(lenghtPtr.value)));
}

One of the C libraries I'm considering using require that you pass a Pointer when calling certain functions that create C objects and as far as I know, the only way to get such a pointer is to allocate it on the heap, where C could just allocate it on the stack. Not having to free it is also a nice bonus.

cc @dcharkes

I took a look at whether I could this in a local branch but I realized that suspending functions (async, async*, sync*) and OSR makes this practically impossible as stack addresses are not stable.

Maybe in & out params could work for the most common cases where stack allocation is desired.

Like:

typedef QueryInterface_Native = Int32 Function(Pointer<Void> self, Pointer<Guid> riid, Out<Pointer<Void>> object);

typedef QueryInterface_Dart = (int result, Pointer<Void> object) Function(Pointer<Void> self, Pointer<Guid> riid);

Two new FFI marker types Out and InOut that causes the compiler to emit a stack allocation for the native call. Out parameters in the returned record appear in the order specified. If the original return result is void then it is omitted.

I took a look at whether I could this in a local branch but I realized that suspending functions (async, async*, sync*) and OSR makes this practically impossible as stack addresses are not stable.

Good observation.

We can force-optimize compile in JIT, but even then discovering new classes in the class hierarchy can lead to OSR I believe.

Maybe in & out params could work for the most common cases where stack allocation is desired.

We have logic for this for struct-by-value returns that have the ABI where a pointer to memory is passed in. This allocates on the stack inside the FfiCallInstr. (And the value is copied from the stack to a TypedData on return.)

I'd probably only want to commit to adding a feature such as this if using the arena from package:ffi to malloc and free the out-pointers is prohibitively slow. And, I'd want to land https://dart-review.googlesource.com/c/sdk/+/284300 and convert everything to @Natives first, as the leaf @Native calls are quite fast after we land that CL.

We should already be able to get some benefit by converting the current package:ffi stuff to leaf-calls https://github.com/dart-lang/native/issues/922.

Are you currently running into performance issues?

Not hitting any performance issues yet (though I'm using a custom zone-based allocator). It would be great to avoid depending on an allocator for these simple cases since they're quite frequent. Practically every C API uses this pattern to pass back pointers since they reserve the return value for error codes, etc.

Experimented with adding InOut and In parameters locally. I have it working for integer types at the moment.

It's pretty hacky. I ended up introducing two IL instructions to launder a value through the stack.

Uninitialized IL instr is a no-op Definition that introduces a value.
FfiCall passes the pointer to the uninitialized stack value (using the same code path for Handles).
Initialized IL instr transfers the value from the stack to a register (RequiresStack for the input, RequiresRegister for the output, EmitNativeCode just calls EmitMove)

Ideally support for explicit stack IR instructions would make this easy. Could create a fixed-size stack resource, then use LoadIndexed and StoreIndexed to access it, and pass it by address into FFI functions. Could also use it to allocation sink fixed-size typed data. One barrier is how locations, the register allocator, and the stack interact. The stack resources are fixed size, and you can't allocate them contiguously. I debated trying to extend linear scan to allow wider locations, but it felt like it would change a lot of code.

@pragma('vm:testing:print-flow-graph')
@pragma('vm:never-inline')
int bar(int iUnknownPtr) {
  Pointer<Pointer<Pointer<NativeFunction<Void Function(IntPtr, Out<IntPtr> pp)>>>> ptr =
      Pointer.fromAddress(iUnknownPtr);
  return ptr.value.value.asFunction<int Function(int, ())>(isLeaf: true)(0,
      ());
}

*** BEGIN CFG
After AllocateRegisters
==== <snip>/out_param_single_no_return.dart_::_bar (RegularFunction)
  0: B0[graph]:0 {
      v58 <- UnboxedConstant(#0) [0, 0] int64
}
  2: B1[function entry]:2 {
      v2 <- Parameter(0) [-9223372036854775808, 9223372036854775807] int64
}
  4:     CheckStackOverflow:8(stack=0, loop=0)
  6:     ParallelMove rsi <- S+2
  6:     v69 <- IntConverter(int64->untagged[tr], v2 T{int}) untagged
  8:     v83 <- LoadIndexed(v69 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
 10:     v88 <- IntConverter(int64->untagged[tr], v83 T{int}) untagged
 12:     v103 <- LoadIndexed(v88 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
 14:     v53 <- Uninitialized() int64
 15:     ParallelMove rcx <- C, rax <- rax, S-1 <- rsi
 16:     v55 <- FfiCall:28( pointer=v103 T{int}, v58 T{_Smi} (@rcx int64), v53 (@rdx int64)) int64
 18:     v56 <- Initialized(v53) int64
 20:     Return:32(v56 T{*?})
*** END CFG
Code for optimized function '<snip>/out_param_single_no_return.dart_::_bar' (RegularFunction) {
        ;; B0
        ;; B1
        ;; Enter frame
        ;; PrologueOffset = 0
0x1e5c0519390    55                     push rbp
0x1e5c0519391    4889e5                 movq rbp,rsp
0x1e5c0519394    4883ec08               subq rsp,8
        ;; CheckStackOverflow:8(stack=0, loop=0)
0x1e5c0519398    493b6638               cmpq rsp,[thr+0x38]
0x1e5c051939c    0f8663000000           jna 0x000001e5c0519405
        ;; ParallelMove rsi <- S+2
0x1e5c05193a2    488b7510               movq rsi,[rbp+0x10]
        ;; v69 <- IntConverter(int64->untagged[tr], v2 T{int}) untagged
        ;; v83 <- LoadIndexed(v69 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1e5c05193a6    488b3e                 movq rdi,[rsi]
        ;; ParallelMove rdi <- rdi
        ;; v88 <- IntConverter(int64->untagged[tr], v83 T{int}) untagged
        ;; v103 <- LoadIndexed(v88 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1e5c05193a9    488b07                 movq rax,[rdi]
        ;; v53 <- Uninitialized() int64
        ;; ParallelMove rcx <- C, rax <- rax, S-1 <- rsi
0x1e5c05193ac    488975f8               movq [rbp-0x8],rsi
0x1e5c05193b0    33c9                   xorl rcx,rcx
        ;; v55 <- FfiCall:28( pointer=v103 T{int}, v58 T{_Smi} (@rcx int64), v53 (@rdx int64)) int64
0x1e5c05193b2    4989e4                 movq r12,rsp
0x1e5c05193b5    4883e4f0               andq rsp,-0xffffff10
        ;; EmitParamMoves
        ;; arg_index 0 arg_target rcx int64
        ;;   def_index 0
        ;; def_target rcx int64 <- origin rcx int64
        ;; arg_index 1 arg_target rdx int64
        ;;   def_index 1
        ;; marshaller_.IsHandle(arg_index)
0x1e5c05193b9    488d55f8               leaq rdx,[rbp-0x8]
        ;; EmitParamMovesEnd
        ;; Leaf Call
0x1e5c05193bd    4989aeb0060000         movq [thr+0x6b0],rbp
0x1e5c05193c4    498986d0060000         movq [thr+0x6d0],rax
0x1e5c05193cb    4883ec20               subq rsp,0x20
0x1e5c05193cf    ffd0                   call rax
0x1e5c05193d1    4883c420               addq rsp,0x20
0x1e5c05193d5    49c786d006000008000000 movq [thr+0x6d0],8
0x1e5c05193e0    49c786b006000000000000 movq [thr+0x6b0],0
0x1e5c05193eb    4c89e4                 movq rsp,r12
        ;; v56 <- Initialized(v53) int64
0x1e5c05193ee    488b45f8               movq rax,[rbp-0x8]
        ;; ParallelMove rax <- rax
        ;; Return:32(v56 T{*?})
        ;; Stack Check
0x1e5c05193f2    4889e7                 movq rdi,rsp
0x1e5c05193f5    482bfd                 subq rdi,rbp
0x1e5c05193f8    4883fff8               cmpq rdi,-8
0x1e5c05193fc    7401                   jz 0x000001e5c05193ff
0x1e5c05193fe    cc                     int3
0x1e5c05193ff    4889ec                 movq rsp,rbp
0x1e5c0519402    5d                     pop rbp
0x1e5c0519403    c3                     ret
0x1e5c0519404    cc                     int3
        ;; CheckStackOverflowSlowPath
0x1e5c0519405    41ff9620020000         call [thr+0x220]
0x1e5c051940c    eb94                   jmp 0x000001e5c05193a2
}

Interesting!

Experimented with adding InOut and In parameters locally. I have it working for integer types at the moment.

It's pretty hacky. I ended up introducing two IL instructions to launder a value through the stack.

Uninitialized IL instr is a no-op Definition that introduces a value.

FfiCall passes the pointer to the uninitialized stack value (using the same code path for Handles).

Initialized IL instr transfers the value from the stack to a register (RequiresStack for the input, RequiresRegister for the output, EmitNativeCode just calls EmitMove)

Ideally support for explicit stack IR instructions would make this easy. Could create a fixed-size stack resource, then use LoadIndexed and StoreIndexed to access it, and pass it by address into FFI functions. Could also use it to allocation sink fixed-size typed data. One barrier is how locations, the register allocator, and the stack interact. The stack resources are fixed size, and you can't allocate them contiguously. I debated trying to extend linear scan to allow wider locations, but it felt like it would change a lot of code.

Yeah that's a lot of technical complexity added. Maybe an alternative implementation would be to go with the struct-by-value approach where we allocate space by use of RequiredStackSpaceInBytes in the FFI call instruction and change the FfiCallInstruction to return a record of the return value and all the out params.

But before we commit to added technical complexity, we should asses the performance and API benefits. We should set up some benchmarks that would benefit from having this. And benchmark against alternatives. (And assess the API differences for alternatives.)

One performance oriented alternative I can think about to reduce allocations is to introduce a new BumpAllocator in package:ffi that allocates lets say a megabyte of data, and holds on to that. Allocating would be only pointer arithmetic, and freeing would also only be bookkeeping.

import 'package:ffi/ffi.dart';

void main() {
  final function = library.lookupFunction<...>('load_some_string');
  final namePtr = bump<Pointer<Uint8>>();
  final lengthPtr = bump<Int>();

  function(0, namePtr, lengthPtr);
  print(utf8.decode(namePtr.asTypedList(lenghtPtr.value)));

  bump.free(namePtr);
  bump.free(lengthPtr);
}

(The internal book-keeping of the bump allocator needs to save size for each pointer, and keep some efficient data structure of which pointers have not been freed yet.)

For an API, maybe we also want to have some arena-style API so that you can free multiple pointers at the same time.

Or a completely different approach if performance is not of a concern but only API, is to have a CFE transform that basically uses the arena under the hood (the only issue is that the Arena allocator lives in package:ffi at the moment).

But before we commit to added technical complexity, we should asses the performance and API benefits. We should set up some benchmarks that would benefit from having this. And benchmark against alternatives. (And assess the API differences for alternatives.)

Understandable, I primarily wanted to do some experimentation on my own.

Once Struct.create is finalized this'll become more prominent since developers will want to pass a created struct into a C API by reference. This is common in COM (incl DirectX), Vulkan, Win32 APIs, etc.

Maybe an alternative implementation would be to go with the struct-by-value approach where we allocate space by use of RequiredStackSpaceInBytes in the FFI call instruction and change the FfiCallInstruction to return a record of the return value and all the out params.

Looks like there's three approaches here:

Initializing the output record within FfiCallInstr
Stack allocation with a hypothetical AllocateStackInstr and using LoadIndexed/StoreIndexed to read and write to that allocation
Extending IL to support multiple output locations

Initializing the record inside of FfiCall would add even more complexity to FfiCallInstr::EmitNativeCode and the record would no longer be subject to allocation sinking. But managing the stack solely within the call instruction would avoid widespread changes.

Multiple IL outputs would be similar to the record approach, but without the need to allocate one up front. But its a very very heavy lift since everything assumes there's only a single output from each definition instruction.

Stack allocation requires adding a new representation and extending the register allocator to support allocating multiple contiguous stack slots.

Stack allocation vs multiple outputs would most likely generate the same code.

Once Struct.create is finalized this'll become more prominent since developers will want to pass a created struct into a C API by reference.

The TypedData that's allocated to back the struct is already bump-pointer allocated in the Dart heap. I don't believe stack allocating is going to be much faster. (The objects should not outlive the new space, so the only effect is marginally more often new space collection.) We should definitely include this pattern in benchmarks. Even more so, even if we have APIs with pointers, as long as the FFI calls are marked leaf-calls, we can used TypedDatas. So that approach would work for non-struct arguments.

Understandable, I primarily wanted to do some experimentation on my own.

P.S. I really appreciate having an extra set of eyes on the FFI internals! ❤️

dart-lang / sdk

ffi: Support memory allocations in the current stack frame #42509