ericsink / SQLitePCL.raw

A Portable Class Library (PCL) for low-level (raw) access to SQLite
Apache License 2.0
533 stars 109 forks source link

What to do about the zero terminator of a string in a Span? #273

Closed ericsink closed 5 years ago

ericsink commented 5 years ago

The use of Span has raised some interesting questions.

In the SQLite API, when a parameter is a string, it is usually just a pointer, with the (documented) expectation that the string ends with a zero terminator byte.

So, at a higher layer, if we use a ReadOnlySpan<byte> for that string, the Length or the span is ignored. By the time it gets to SQLite, it's just a pointer. Is that a signal that we shouldn't be using span?

Suppose the string is "hello". That's 5 bytes of actual string plus one byte for the zero terminator. We could pass a span of length 5 or a span of length 6. It doesn't matter to SQLite, because it's just going to look for that zero byte anyway.

Heck, we could pass a span of length 1, and it still wouldn't matter. But I mention that just to illustrate the problem. Clearly, if we're going to use a span, we should have the length be correct.

But which is more correct? Should the Length include the zero terminator or not?

Including the zero byte in the length seems wrong because the zero is not part of the string data. If we called strlen() on the pointer, it would return a length which did not include the zero. If we passed this pointer to System.Text methods to convert it from UTF-8 to System.String, the length we provide should not include the zero byte. The length of the string "hello" is 5.

However, it also seems wrong to not include the zero byte in the length. The zero byte may not be part of the string, but it is a documented and required part of the block of bytes the function expects. And if we don't include the zero byte, then we are passing a span to a function which is going to walk beyond the end of that span's length, which seems wrong. The length of the zero-terminated block of memory that represents "hello" is 6.

At the moment, the solution I have chosen is to include the zero byte in the span length. And the provider code checks to make sure the last byte of the span is a zero.

But this is only half the story. We have a similar but slightly different set of problems for cases where SQLite is returning a string or passing one in a callback.

In this case, SQLite itself is responsible for putting that zero byte in place, and it does. But when we construct a ReadOnlySpan<byte> for that piece of memory, again, do we include the zero byte in the length or not?

In this case, a big difference is that the code that receives that span is likely to use the Length and ignore the presence of any zero terminator For example, using System.Text functions to convert from UTF-8 to a string, the span Length must be correct and must not include the zero terminator.

If we did include the zero terminator in this upward-moving set of strings, then most cases that use the spans would need to subtract one from the Length before using it. And should it check to make sure the last byte actually is a zero before doing so?

So for the moment, the solution I have chosen is to NOT include the zero terminator when constructing a span for a string that is coming from SQLite.

Neither of these decisions seems correct. Each one merely seems to be the least incorrect option.

And then there is the inconsistency. I have two very similar questions, but I came up with two different answers. That too seems wrong. For example, as things stand right now, if you receive a string from SQLite as a span and then pass that span directly back to another SQLite function, it won't work.

ericsink commented 5 years ago

@bricelam Have you folks at Microsoft developed any recommended guidelines for using Span with unmanaged libraries that deal in zero-terminated strings?

bricelam commented 5 years ago

@divega might know who to ask...

divega commented 5 years ago

I would start with @jkotas.

jkotas commented 5 years ago

For public .NET Core APIs that take or return Span<byte> or Span<char> :

The internal .NET Core implementation has a few places that create and operate on Spans that are zero terminated. This is local internal implementation detail that does not leak out through public APIs.

ericsink commented 5 years ago

@jkotas That makes sense. For a .NET-only API.

But what if you were passing a Span through P/Invoke to a C function that expects the zero terminator? Would you include that zero byte in the span length?

jkotas commented 5 years ago

P/Invoke methods cannot take Spans directly today.

I think you are really asking whether you should include zero byte in the Span length in your internal implementation details that prepare string passed to PInvoke. It is up to you and what makes sense for the type of code that you are writing.

For public APIs, I would recommend to be on the same plan as public .NET Core APIs.

ericsink commented 5 years ago

Okay. I admit this issue is arcane and my explanation of it above is tedious.

Anyway, at some point I will eventually cross paths with somebody else who is weighing the pros and cons of this.

Thanks for the replies.

ericsink commented 5 years ago

For others who might visit this issue, I want to add a bit more explanation for why "be on the same plan as public .NET Core APIs" doesn't seem helpful in this context.

The underlying C function requires a zero terminated string. That's just the way it is.

So let's say I expose a C# wrapper for that function which takes a Span, and to be consistent with .NET practices, the caller is not expected to include a zero terminator.

What that means is that I have to make a copy of the data in order to append the zero terminator before passing it down to the unmanaged function.

And this would defeat the whole point of using Span.

tmds commented 5 years ago

Maybe you can write a ref struct that encapsulates Span<byte> and represents a zero-terminated string?

ericsink commented 5 years ago

@tmds I've been wondering about that possibility myself.

ericsink commented 5 years ago

A tidbit of good news noticed by sgjennings on Twitter:

This writeup on the upcoming Utf8String type:

dotnet/corefxlab#2368

contains this bit:

"ensuring a null terminator (important for p/invoke scenarios)"

which indicates an awareness of this issue.

I was aware of Utf8String, but had not yet seen any indication that dealing with zero terminator cases was (or was not) being considered as part of the design.

ericsink commented 5 years ago

@tmds So yeah, as a first attempt, that idea was easier than I thought. See 17785ea

So sz (which needs a better name) is a ref struct type that simply encapsulates (and hides) a ReadOnlySpan<byte>, with the additional property of knowing for certain that it includes a zero terminator, because its only constructor ensures that. It also supports the fixed pattern. In fact, right now that's its whole API, a constructor and GetPinnableReference().

The notion here is that sz would be used in cases where the utf8 string is required to have a zero terminator, leaving ReadOnlySpan<byte> to be used in places where the full expectations associated with span are appropriate.

As mentioned in the commit message, this is an experimental idea. I'm just throwing some code around to see what works.

ericsink commented 5 years ago

After continuing the sz experiment for a return value and a callback parameter, it continues to work well.

I'm starting to believe the best practice here is to NOT publicly use ReadOnlySpan<byte> to describe a pointer that needs a zero terminator. Using ReadOnlySpan<byte> sets certain expectations, such as that slicing works, and there may not be a zero terminator, and "if there is a zero terminator it is certainly not included in the Length", and so on.

As an alternative, the sz concept (or something like it) gives the benefits of a span (it's still a ref struct) but gives type safety for the use cases where the underlying C-ish API requires a zero terminator. It is basically a constrained span with very limited functionality.

This may in fact be what @jkotas meant when he said "in your internal implementation details that prepare string passed to PInvoke", but I initially didn't understand it because I had not yet implemented my own ref struct type.

jkotas commented 5 years ago

I agree that the type safe wrapper makes in your case since you are passing the value around. I have not thought about it, but you have figure it out. It looks pretty neat.

tmds commented 5 years ago

as a first attempt, that idea was easier than I thought.

cool!

This writeup on the upcoming Utf8String type:

Compared to the Span approach, afaik, Utf8String won't allow you to wrap native memory. Nor can you stackalloc the string. The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string). With the Span approach, you can also avoid memory allocations in some cases. It makes sense to use the Span approach where those allocations can be avoided, and Utf8String otherwise. Last I heard, Utf8String isn't happening for 3.0.

After continuing the sz experiment for a return value and a callback parameter

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

ericsink commented 5 years ago

Utf8String won't allow you to wrap native memory. The advantage of Utf8String over string is you no longer convert between between UTF-8 (SQLLite) and UTF-16 (.NET string).

Yeah, Utf8String (someday) will be complementary, but occupies a different place in the world. I'm trying to make it possible to use SQLitePCLRaw in a situation where everything is UTF-8, with minimal conversion and copying. For now, if the enclosing app or library is built around System.String, it won't be able to take full advantage, but I'll be more ready for the day when Utf8String is available.

In a callback the sz will be valid during the callback. But with a return value, there is an issue of scope when the sz is wrapping native memory.

Yeah, the method I experimentally converted over was already returning ReadOnlySpan<byte>, so that same limitation applied. The caller is expected to copy it if needed. For example, sqlite3_column_text() returns char *. If C# could overload on return type, I would provide one overload returning sz and one returning string to do the conversion. But since I can only have one method with that name, I choose the lower-level approach and let the caller do the extra work.

Thanks. Cheers.