Open simolus3 opened 4 years ago
One of the C libraries I'm considering using require that you pass a Pointer
cc @dcharkes
I took a look at whether I could this in a local branch but I realized that suspending functions (async
, async*
, sync*
) and OSR makes this practically impossible as stack addresses are not stable.
Maybe in & out params could work for the most common cases where stack allocation is desired.
Like:
typedef QueryInterface_Native = Int32 Function(Pointer<Void> self, Pointer<Guid> riid, Out<Pointer<Void>> object);
typedef QueryInterface_Dart = (int result, Pointer<Void> object) Function(Pointer<Void> self, Pointer<Guid> riid);
Two new FFI marker types Out
and InOut
that causes the compiler to emit a stack allocation for the native call. Out parameters in the returned record appear in the order specified. If the original return result is void then it is omitted.
I took a look at whether I could this in a local branch but I realized that suspending functions (
async
,async*
,sync*
) and OSR makes this practically impossible as stack addresses are not stable.
Good observation.
We can force-optimize compile in JIT, but even then discovering new classes in the class hierarchy can lead to OSR I believe.
Maybe in & out params could work for the most common cases where stack allocation is desired.
We have logic for this for struct-by-value returns that have the ABI where a pointer to memory is passed in. This allocates on the stack inside the FfiCallInstr
. (And the value is copied from the stack to a TypedData on return.)
I'd probably only want to commit to adding a feature such as this if using the arena
from package:ffi
to malloc
and free
the out-pointers is prohibitively slow. And, I'd want to land https://dart-review.googlesource.com/c/sdk/+/284300 and convert everything to @Native
s first, as the leaf @Native
calls are quite fast after we land that CL.
We should already be able to get some benefit by converting the current package:ffi stuff to leaf-calls https://github.com/dart-lang/native/issues/922.
Are you currently running into performance issues?
Not hitting any performance issues yet (though I'm using a custom zone-based allocator). It would be great to avoid depending on an allocator for these simple cases since they're quite frequent. Practically every C API uses this pattern to pass back pointers since they reserve the return value for error codes, etc.
Experimented with adding InOut
and In
parameters locally. I have it working for integer types at the moment.
It's pretty hacky. I ended up introducing two IL instructions to launder a value through the stack.
Uninitialized
IL instr is a no-op Definition
that introduces a value.FfiCall
passes the pointer to the uninitialized stack value (using the same code path for Handles).Initialized
IL instr transfers the value from the stack to a register (RequiresStack for the input, RequiresRegister for the output, EmitNativeCode
just calls EmitMove
)Ideally support for explicit stack IR instructions would make this easy. Could create a fixed-size stack resource, then use LoadIndexed
and StoreIndexed
to access it, and pass it by address into FFI functions. Could also use it to allocation sink fixed-size typed data. One barrier is how locations, the register allocator, and the stack interact. The stack resources are fixed size, and you can't allocate them contiguously. I debated trying to extend linear scan to allow wider locations, but it felt like it would change a lot of code.
@pragma('vm:testing:print-flow-graph')
@pragma('vm:never-inline')
int bar(int iUnknownPtr) {
Pointer<Pointer<Pointer<NativeFunction<Void Function(IntPtr, Out<IntPtr> pp)>>>> ptr =
Pointer.fromAddress(iUnknownPtr);
return ptr.value.value.asFunction<int Function(int, ())>(isLeaf: true)(0,
());
}
*** BEGIN CFG
After AllocateRegisters
==== <snip>/out_param_single_no_return.dart_::_bar (RegularFunction)
0: B0[graph]:0 {
v58 <- UnboxedConstant(#0) [0, 0] int64
}
2: B1[function entry]:2 {
v2 <- Parameter(0) [-9223372036854775808, 9223372036854775807] int64
}
4: CheckStackOverflow:8(stack=0, loop=0)
6: ParallelMove rsi <- S+2
6: v69 <- IntConverter(int64->untagged[tr], v2 T{int}) untagged
8: v83 <- LoadIndexed(v69 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
10: v88 <- IntConverter(int64->untagged[tr], v83 T{int}) untagged
12: v103 <- LoadIndexed(v88 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
14: v53 <- Uninitialized() int64
15: ParallelMove rcx <- C, rax <- rax, S-1 <- rsi
16: v55 <- FfiCall:28( pointer=v103 T{int}, v58 T{_Smi} (@rcx int64), v53 (@rdx int64)) int64
18: v56 <- Initialized(v53) int64
20: Return:32(v56 T{*?})
*** END CFG
Code for optimized function '<snip>/out_param_single_no_return.dart_::_bar' (RegularFunction) {
;; B0
;; B1
;; Enter frame
;; PrologueOffset = 0
0x1e5c0519390 55 push rbp
0x1e5c0519391 4889e5 movq rbp,rsp
0x1e5c0519394 4883ec08 subq rsp,8
;; CheckStackOverflow:8(stack=0, loop=0)
0x1e5c0519398 493b6638 cmpq rsp,[thr+0x38]
0x1e5c051939c 0f8663000000 jna 0x000001e5c0519405
;; ParallelMove rsi <- S+2
0x1e5c05193a2 488b7510 movq rsi,[rbp+0x10]
;; v69 <- IntConverter(int64->untagged[tr], v2 T{int}) untagged
;; v83 <- LoadIndexed(v69 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1e5c05193a6 488b3e movq rdi,[rsi]
;; ParallelMove rdi <- rdi
;; v88 <- IntConverter(int64->untagged[tr], v83 T{int}) untagged
;; v103 <- LoadIndexed(v88 T{Object}, v58 T{_Smi}) [-9223372036854775808, 9223372036854775807] int64
0x1e5c05193a9 488b07 movq rax,[rdi]
;; v53 <- Uninitialized() int64
;; ParallelMove rcx <- C, rax <- rax, S-1 <- rsi
0x1e5c05193ac 488975f8 movq [rbp-0x8],rsi
0x1e5c05193b0 33c9 xorl rcx,rcx
;; v55 <- FfiCall:28( pointer=v103 T{int}, v58 T{_Smi} (@rcx int64), v53 (@rdx int64)) int64
0x1e5c05193b2 4989e4 movq r12,rsp
0x1e5c05193b5 4883e4f0 andq rsp,-0xffffff10
;; EmitParamMoves
;; arg_index 0 arg_target rcx int64
;; def_index 0
;; def_target rcx int64 <- origin rcx int64
;; arg_index 1 arg_target rdx int64
;; def_index 1
;; marshaller_.IsHandle(arg_index)
0x1e5c05193b9 488d55f8 leaq rdx,[rbp-0x8]
;; EmitParamMovesEnd
;; Leaf Call
0x1e5c05193bd 4989aeb0060000 movq [thr+0x6b0],rbp
0x1e5c05193c4 498986d0060000 movq [thr+0x6d0],rax
0x1e5c05193cb 4883ec20 subq rsp,0x20
0x1e5c05193cf ffd0 call rax
0x1e5c05193d1 4883c420 addq rsp,0x20
0x1e5c05193d5 49c786d006000008000000 movq [thr+0x6d0],8
0x1e5c05193e0 49c786b006000000000000 movq [thr+0x6b0],0
0x1e5c05193eb 4c89e4 movq rsp,r12
;; v56 <- Initialized(v53) int64
0x1e5c05193ee 488b45f8 movq rax,[rbp-0x8]
;; ParallelMove rax <- rax
;; Return:32(v56 T{*?})
;; Stack Check
0x1e5c05193f2 4889e7 movq rdi,rsp
0x1e5c05193f5 482bfd subq rdi,rbp
0x1e5c05193f8 4883fff8 cmpq rdi,-8
0x1e5c05193fc 7401 jz 0x000001e5c05193ff
0x1e5c05193fe cc int3
0x1e5c05193ff 4889ec movq rsp,rbp
0x1e5c0519402 5d pop rbp
0x1e5c0519403 c3 ret
0x1e5c0519404 cc int3
;; CheckStackOverflowSlowPath
0x1e5c0519405 41ff9620020000 call [thr+0x220]
0x1e5c051940c eb94 jmp 0x000001e5c05193a2
}
Interesting!
Experimented with adding
InOut
andIn
parameters locally. I have it working for integer types at the moment.It's pretty hacky. I ended up introducing two IL instructions to launder a value through the stack.
Uninitialized
IL instr is a no-opDefinition
that introduces a value.FfiCall
passes the pointer to the uninitialized stack value (using the same code path for Handles).Initialized
IL instr transfers the value from the stack to a register (RequiresStack for the input, RequiresRegister for the output,EmitNativeCode
just callsEmitMove
)Ideally support for explicit stack IR instructions would make this easy. Could create a fixed-size stack resource, then use
LoadIndexed
andStoreIndexed
to access it, and pass it by address into FFI functions. Could also use it to allocation sink fixed-size typed data. One barrier is how locations, the register allocator, and the stack interact. The stack resources are fixed size, and you can't allocate them contiguously. I debated trying to extend linear scan to allow wider locations, but it felt like it would change a lot of code.
Yeah that's a lot of technical complexity added.
Maybe an alternative implementation would be to go with the struct-by-value approach where we allocate space by use of RequiredStackSpaceInBytes
in the FFI call instruction and change the FfiCallInstruction to return a record of the return value and all the out params.
But before we commit to added technical complexity, we should asses the performance and API benefits. We should set up some benchmarks that would benefit from having this. And benchmark against alternatives. (And assess the API differences for alternatives.)
One performance oriented alternative I can think about to reduce allocations is to introduce a new BumpAllocator
in package:ffi
that allocates lets say a megabyte of data, and holds on to that. Allocating would be only pointer arithmetic, and freeing would also only be bookkeeping.
import 'package:ffi/ffi.dart';
void main() {
final function = library.lookupFunction<...>('load_some_string');
final namePtr = bump<Pointer<Uint8>>();
final lengthPtr = bump<Int>();
function(0, namePtr, lengthPtr);
print(utf8.decode(namePtr.asTypedList(lenghtPtr.value)));
bump.free(namePtr);
bump.free(lengthPtr);
}
(The internal book-keeping of the bump allocator needs to save size for each pointer, and keep some efficient data structure of which pointers have not been freed yet.)
For an API, maybe we also want to have some arena-style API so that you can free multiple pointers at the same time.
Or a completely different approach if performance is not of a concern but only API, is to have a CFE transform that basically uses the arena under the hood (the only issue is that the Arena allocator lives in package:ffi
at the moment).
But before we commit to added technical complexity, we should asses the performance and API benefits. We should set up some benchmarks that would benefit from having this. And benchmark against alternatives. (And assess the API differences for alternatives.)
Understandable, I primarily wanted to do some experimentation on my own.
Once Struct.create
is finalized this'll become more prominent since developers will want to pass a created struct into a C API by reference. This is common in COM (incl DirectX), Vulkan, Win32 APIs, etc.
Maybe an alternative implementation would be to go with the struct-by-value approach where we allocate space by use of
RequiredStackSpaceInBytes
in the FFI call instruction and change the FfiCallInstruction to return a record of the return value and all the out params.
Looks like there's three approaches here:
FfiCallInstr
AllocateStackInstr
and using LoadIndexed
/StoreIndexed
to read and write to that allocationInitializing the record inside of FfiCall
would add even more complexity to FfiCallInstr::EmitNativeCode
and the record would no longer be subject to allocation sinking. But managing the stack solely within the call instruction would avoid widespread changes.
Multiple IL outputs would be similar to the record approach, but without the need to allocate one up front. But its a very very heavy lift since everything assumes there's only a single output from each definition instruction.
Stack allocation requires adding a new representation and extending the register allocator to support allocating multiple contiguous stack slots.
Stack allocation vs multiple outputs would most likely generate the same code.
Once
Struct.create
is finalized this'll become more prominent since developers will want to pass a created struct into a C API by reference.
The TypedData that's allocated to back the struct is already bump-pointer allocated in the Dart heap. I don't believe stack allocating is going to be much faster. (The objects should not outlive the new space, so the only effect is marginally more often new space collection.) We should definitely include this pattern in benchmarks. Even more so, even if we have APIs with pointers, as long as the FFI calls are marked leaf-calls, we can used TypedDatas. So that approach would work for non-struct arguments.
Understandable, I primarily wanted to do some experimentation on my own.
P.S. I really appreciate having an extra set of eyes on the FFI internals! ❤️
I think there should be a way to allocate memory on the stack with
dart:ffi
, for instance with a builtin function likealloca
.As a motivation for this, let's say we had a function
int load_some_string(int index, char **nameOut, int *lengthOut)
. When calling that in C, we'd might doObviously that doesn't work in Dart, so we have to use
I think this is rather annoying because
free
and leak memorymalloc
/free
. Obviously we can re-use those pointers but that also amplifies the other problems.This would be much simpler if we could allocate memory directly on the stack: