dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Allow `LibraryImport` `StringMarshalling.Utf8` to provide UTF-8 byte count to native code #92361

Open bgrainger opened 1 year ago

bgrainger commented 1 year ago

Many C APIs that process UTF-8 string data take parameters of the form const char * text, const size_t textLength and assume that the caller will provide the length of the string (in UTF-8 bytes).

An example is sqlite3_bind_text, which might be declared as:

// int sqlite3_bind_text(sqlite3_stmt*,int,const char*,int,void(*)(void*));

[LibraryImport("sqlite3.dll", StringMarshalling = StringMarshalling.Utf8)]
private static partial int sqlite3_bind_text(IntPtr stmt, int index, string text, int textLength, IntPtr destructor);

When [LibraryImport("X.dll", StringMarshalling.Utf8)] is used with a string parameter, the string is efficiently converted to UTF-8 bytes (by the generated marshalling code), but the length of the converted string is not available to the application (to pass as the textLength parameter).

To pass the byte length of the UTF-8 data passed to native code, the application either has to:

  1. Call Encoding.UTF8.GetByteCount(text) and recalculate the length of the converted bytes (which is inefficient)
  2. Declare the parameter type as Span<byte> and write custom marshalling code that performs the conversion in one pass (which is a lot of extra code to write and get right, and defeats the point of using [LibraryImport] to do the heavy lifting); example below
  3. Pass -1 as the value for textLength if the C API supports that as a sentinel value meaning to calculate the length of NULL-terminated input with strlen or similar (which is inefficient)

An example of (2) might be:

byte[]? temp = null;
try
{
    const int maxStackLength = 1024;
    Span<byte> utf8Bytes = text.Length < maxStackLength ? stackalloc byte[maxStackLength * 3] : (temp = ArrayPool<byte>.Shared.Rent(text.Length * 3)).AsSpan();
    utf8Bytes = utf8Bytes[..Encoding.UTF8.GetBytes(text, utf8Bytes)];

    return sqlite3_bind_text(_stmt, 0, utf8Bytes, utf8Bytes.Length, IntPtr.Zero);
}
finally
{
    if (temp is not null)
        ArrayPool<byte>.Shared.Return(temp);
}

[LibraryImport("sqlite3.dll")]
private static partial int sqlite3_bind_text(IntPtr stmt, int index, Span<byte> text, int textLength, IntPtr destructor);

It would be nice if there was some way in the native method signature (MarshalAs.SizeParamIndex??) to indicate that an integer parameter should be set to the value of Encoding.UTF8.GetBytes that is already calculated by Utf8StringMarshaller.ManagedToUnmanagedIn.FromManaged for an associated string.

It might then be odd that the [LibraryImport] native method signature declares an int textLength parameter that the application doesn't use (because the marshalling code always provides a value for it). Perhaps one might want to have an attribute on the string parameter that indicates it will be marshalled as two values: a pointer to data and its length. Such a thing could also be generally useful for marshalling Span<T> to native code. However, if changing the number of parameters of a native method is outside the design scope of [LibraryImport] then I can understand not wanting to do this.

One final reason to support this is that it could avoid subtle bugs:

// !! WRONG Passes the number of UTF-16 code units, not UTF-8 bytes !!
sqlite3_bind_text(_stmt, 0, text, text.Length, IntPtr.Zero);

[LibraryImport("sqlite3.dll", StringMarshalling = StringMarshalling.Utf8)]
private static partial int sqlite3_bind_text(IntPtr stmt, int index, string text, int textLength, IntPtr destructor);
ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/interop-contrib See info in area-owners.md if you want to be subscribed.

Issue Details
Many C APIs (that process UTF-8) string data take parameters of the form `const char * text, const size_t textLength` and assume that the caller will provide the length of the string (in UTF-8 bytes). An example is [`sqlite3_bind_text`](https://www.sqlite.org/capi3ref.html#sqlite3_bind_blob), which might be declared as: ```csharp // int sqlite3_bind_text(sqlite3_stmt*,int,const char*,int,void(*)(void*)); [LibraryImport("sqlite3.dll", StringMarshalling = StringMarshalling.Utf8)] private static partial int sqlite3_bind_text(IntPtr stmt, int index, string text, int textLength, IntPtr destructor); ``` When `[LibraryImport("cld2.dll", StringMarshalling.Utf8)]` is used with a `string` parameter, the `string` is efficiently converted to UTF-8 bytes (by the generated marshalling code), but the length of the converted string is not available to the application (to pass as the `textLength` parameter). To pass the byte length of the UTF-8 data passed to native code, the application either has to: 1. Call `Encoding.UTF8.GetByteCount(text)` and recalculate the length of the converted bytes (which is inefficient) 2. Declare the parameter type as `Span` and write custom marshalling code that performs the conversion in one pass (which is a lot of extra code to write and get right, and defeats the point of using `[LibraryImport]` to do the heavy lifting); example below 3. Pass `-1` as the value for `textLength` if the C API supports that as a sentinel value meaning to calculate the length of `NULL`-terminated input with `strlen` or similar (which is inefficient) An example of (2) might be: ```csharp byte[]? temp = null; try { const int maxStackLength = 1024; Span utf8Bytes = text.Length < maxStackLength ? stackalloc byte[maxStackLength * 3] : (temp = ArrayPool.Shared.Rent(text.Length * 3)).AsSpan(); utf8Bytes = utf8Bytes[..Encoding.UTF8.GetBytes(text, utf8Bytes)]; return sqlite3_bind_text(_stmt, 0, utf8Bytes, utf8Bytes.Length, IntPtr.Zero); } finally { if (temp is not null) ArrayPool.Shared.Return(temp); } [LibraryImport("sqlite3.dll")] private static partial int sqlite3_bind_text(IntPtr stmt, int index, Span text, int textLength, IntPtr destructor); ``` It would be nice if there was some way in the native method signature (`MarshalAs.SizeParamIndex`??) to indicate that an integer parameter should be set to the value of `Encoding.UTF8.GetBytes` that is already calculated by `Utf8StringMarshaller.ManagedToUnmanagedIn.FromManaged` for an associated `string`. It might then be odd that the `[LibraryImport]` native method signature declares an `int textLength` parameter that the application doesn't use (because the marshalling code always provides a value for it). Perhaps one might want to have an attribute on the `string` parameter that indicates it will be marshalled as two values: a pointer to data and its length. Such a thing could also be generally useful for marshalling `Span` to native code. However, if changing the number of parameters of a native method is outside the design scope of `[LibraryImport]` then I can understand not wanting to do this. One final reason to support this is that it could avoid subtle bugs: ```csharp // !! WRONG Passes the number of UTF-16 code units, not UTF-8 bytes !! sqlite3_bind_text(_stmt, 0, text, text.Length, IntPtr.Zero); [LibraryImport("sqlite3.dll", StringMarshalling = StringMarshalling.Utf8)] private static partial int sqlite3_bind_text(IntPtr stmt, int index, string text, int textLength, IntPtr destructor); ```
Author: bgrainger
Assignees: -
Labels: `area-System.Runtime.InteropServices`
Milestone: -
AaronRobinsonMSFT commented 1 year ago

@jkoritzinsky This is similar to our thoughts on making Span<T> nice in the common pointer + length arguments convention.

jkoritzinsky commented 1 year ago

I agree. @bgrainger thanks for filing this issue! Having external customer requests makes it easier to justify the work.

bgrainger commented 1 year ago

As an extension of the above idea, it would also be nice to marshal a single ReadOnlySpan<char> parameter to native code as two values: a pointer to automatically-converted UTF-8 bytes, and the size of those converted bytes.

huoyaoyuan commented 1 year ago

The major issue is that we can't declare the length parameter in managed signature, namely for its order in parameters and its type (int vs nint).

A possible approach is introducing an attribute on the span/string parameter to specify and index and type.

bgrainger commented 11 months ago

A somewhat related issue from earlier this year: https://github.com/dotnet/runtime/issues/81193.

ThadHouse commented 10 months ago

Adding my hand to upvote this. Our library has a mix of null terminated vs data and length string data (Yes bad design, but hard to change), and it'd be nice to get automated marshalling for all of these cases.

ThadHouse commented 10 months ago

To add on, the return scenario would also be extremely nice to support.

We have the following native API.

char* NT_GetEntryName(NT_Entry entry, size_t* name_len);

I have to do a very convoluted source generator to make this work, it'd be nice to be built in supported.