dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

UTF8 pointer + nuint overloads #68256

Closed SupinePandora43 closed 2 years ago

SupinePandora43 commented 2 years ago

Background and motivation

As said in #67649, Encoding can't be edited for sake of compatibility, but Utf8 is a static class! So it can.

API Proposal

namespace System.Text.Unicode
{
    public static class Utf8
    {
        public static System.Buffers.OperationStatus FromUtf16 (char* source, nuint sourceLength, byte* destination, nuint destinationLength, out nuint charsRead, out nuint bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true);
        public static System.Buffers.OperationStatus ToUtf16 (byte* source, nuint sourceLength, char* destination, nuint destinationLength, out nuint bytesRead, out nuint charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true);
    }
}

API Usage

(string conversion)

Alternative Designs

67649 (edit Encoding)

Risks

No such

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-buffers See info in area-owners.md if you want to be subscribed.

Issue Details
### Background and motivation As said in #67649, `Encoding` can't be edited for sake of compatibility, but Utf8 is a static class! So it can. ### API Proposal ```C# namespace System.Text.Unicode { public static class Utf8 { public static System.Buffers.OperationStatus FromUtf16 (char* source, byte* destination, out nuint charsRead, out nuint bytesWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true); public static System.Buffers.OperationStatus ToUtf16 (byte* source, char* destination, out nuint bytesRead, out nuint charsWritten, bool replaceInvalidSequences = true, bool isFinalBlock = true); } } ``` ### API Usage (string conversion) ### Alternative Designs #67649 (edit `Encoding`) ### Risks No such
Author: SupinePandora43
Assignees: -
Labels: `api-suggestion`, `area-System.Buffers`, `untriaged`
Milestone: -
SupinePandora43 commented 2 years ago

@GrabYourPitchforks no breaking changes here

huoyaoyuan commented 2 years ago

Generally, pointer is risky for buffer overflow comparing to span. Using a single pointer to represent destination is a well known security threat.

SupinePandora43 commented 2 years ago

Generally, pointer is risky for buffer overflow comparing to span. Using a single pointer to represent destination is a well known security threat.

Forgot to add nuint length arguments 😅 would be interesting to know how it would work without it.

huoyaoyuan commented 2 years ago

Well, Span should really be using nint instead of int in the first place. This is another pain with using int.

SupinePandora43 commented 2 years ago

Well, Span should really be using nint instead of int in the first place. This is another pain with using int.

nuint but yeah. it's true

FiniteReality commented 2 years ago

nuint but yeah. it's true

I disagree. A signed integer is often better because it makes the error case which is always hiding there obvious, even if you do have to include more parameter checks.

SupinePandora43 commented 2 years ago

nuint but yeah. it's true

I disagree. A signed integer is often better because it makes the error case which is always hiding there obvious, even if you do have to include more parameter checks.

What do you mean?

FiniteReality commented 2 years ago

nuint but yeah. it's true

I disagree. A signed integer is often better because it makes the error case which is always hiding there obvious, even if you do have to include more parameter checks.

What do you mean?

As an example, let's look at this (pseudo-)code which reads N bytes from a stream:

byte[] ReadNBytes(nint count)
{
    byte[] result = new byte[count];
    byte[] buffer = new byte[1024];
    var bytesToRead = count;
    while (bytesToRead > 0)
    {
        var bytesRead = stream.Read(buffer);
        buffer.Slice(0, bytesRead).CopyTo(result.Slice(count - bytesToRead));
        bytesToRead -= bytesRead;
    }
    return result;
}

If you're using unsigned integers to store the number of bytes to read, you're hiding overflows in bytesToRead -= bytesRead and count - bytesToRead which may mean this loop may end up running forever. With signed integers, bytesToRead will simply go negative and will always end. This is the case even with large integers, for example N - int.MaxValue is always a representable integer without any overflows, for any positive N.

SupinePandora43 commented 2 years ago

nuint but yeah. it's true

I disagree. A signed integer is often better because it makes the error case which is always hiding there obvious, even if you do have to include more parameter checks.

What do you mean?

As an example, let's look at this (pseudo-)code which reads N bytes from a stream:

byte[] ReadNBytes(nint count)
{
    byte[] result = new byte[count];
    byte[] buffer = new byte[1024];
    var bytesToRead = count;
    while (bytesToRead > 0)
    {
        var bytesRead = stream.Read(buffer);
        buffer.Slice(0, bytesRead).CopyTo(result.Slice(count - bytesToRead));
        bytesToRead -= bytesRead;
    }
    return result;
}

If you're using unsigned integers to store the number of bytes to read, you're hiding overflows in bytesToRead -= bytesRead and count - bytesToRead which may mean this loop may end up running forever. With signed integers, bytesToRead will simply go negative and will always end. This is the case even with large integers, for example N - int.MaxValue is always a representable integer without any overflows, for any positive N.

It just meets use of this codestyle with a cost of 1 bit being used for sign instead.

byte[] ReadNBytes(nuint count)
{
    byte[] result = new byte[(int)count];
    byte[] buffer = new byte[1024];
    nuint bytesRead = 0;
    var bytesToRead = count;
    while (bytesRead < count)
    {
        int readThisTime = stream.Read(buffer);
        bytesRead += (nuint) readThisTime
        buffer.Slice(0, readThisTime).CopyTo(result.Slice((int)bytesRead));
    }
    return result;
}

byte[] doesn't accept nint/nuint making it literally useless

gfoidl commented 2 years ago

byte[] doesn't accept nint/nuint making it literally useless

nint/nuint can be used as indexer for T[], or what did you mean?

PS: you don't need all the full-quotes of previous comments, it's just bloat in reading.

tannergooding commented 2 years ago

As for the proposed API. An API dealing with a two allocations more than 2GB in size isn't going to work or behave as well as you might think.

It's definitely convenient to have a single call, but the actual perf here is going to likely be much worse than you're expecting and it wouldn't be how you'd want to handle this in practice.

It would be helpful if you could provide more context as to what kind of inputs you're dealing with that are this big and why you need to convert them in one go, rather than chunking them or parallelizing the processing for better perf.


Let's also not turn this into a discussion around whether signed vs unsigned is better.

There are pro's and con's to each. Signed tends to be more intuitive and less error prone when handling overflow and other failure cases. Unsigned tends to have better optimization support.

The JIT and internal implementation can reasonably optimize never-negative signed when that information is present and so signed ends up being a "better default" from a UX perspective.

The main downside is also not really a downside. You lose a bit that you will, realistically speaking, never need. On a 32-bit machine, you have 4GB of memory and you will practically never be able to do a single allocation that is near or greater than 2GB in size. It's impractical because that's more than half the address space available to the entire machine. You'll in the best case cause some pretty major contention and slow down the entire machine.

On a 64-bit machine, no modern computer supports that much memory anyways and when you do get to more than 2^63 bits of memory, you hit the same contention and other issues that you'd have hit on 32-bit.

GrabYourPitchforks commented 2 years ago

Agree with Tanner - it would be strange to expose pointer-based overloads here without more context as to why they're necessary for extremely large data. I'd much rather see us focus on making Span<T> (or equivalent) able to support larger buffers, which will also involve the creation of an OperationStatus-based pattern for dealing with these inputs. Once this pattern is created, we can consider which APIs (such as this) are candidates for enlightening.

As was mentioned earlier, it's possible to implement this today at the call site via a loop.

Unless there's a strong motivation to keep this around I'm going to let the bot auto-close it.

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

ghost commented 2 years ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

SupinePandora43 commented 2 years ago

@GrabYourPitchforks

public ref struct Span<T>
{
    private ref reference;
    private nuint length;
    [Obsolete (BinaryCompatibilityOnly)]
    public int Length => checked((int)length);
    public nuint Length => length;
}

Then?

GrabYourPitchforks commented 2 years ago

This isn't the right issue to discuss changing Span. I'd suggest taking that discussion to https://github.com/dotnet/runtime/issues/12221.