Add initial Span/Buffer-based APIs across corefx

stephentoub commented 7 years ago

With the advent of Span<T> and Buffer<T>, there are a multitude of improvements we’ll want to make to types across coreclr/corefx. Some of these changes include exposing Span<T>/Buffer<T>/ReadOnlySpan<T>/ReadOnlyBuffer<T> themselves and the associated operations on them. Other changes involve using those types internally to improve memory usage of existing code. This issue isn’t about either of those. Rather, this issue is about what new APIs we want to expose across corefx (with some implementations in coreclr/corert).

A good approximation of a starting point is looking at existing APIs that work with arrays or pointers (and to some extent strings), and determining which of these should have Span/Buffer-based overloads (there will almost certainly be “new” APIs we’ll want to add that don’t currently have overloads, but for the most part those can be dealt with separately and one-off). There are ~3000 such methods that exist today in the corefx reference assemblies. We’re obviously not going to add ~3000 new overloads that work with Span/Buffer, nor should we. But many of these aren’t relevant for one reason or another, e.g. they’re in components considered to be legacy, they’re very unlikely to be used on hot paths where a span/buffer->array conversion would matter at all, etc.

I’ve gone through the framework and identified a relatively small set of methods I believe we should start with and add together for the next release. This is dialable, of course, but I believe we need a solid set of these to represent enough mass that span/buffer permeate the stack and make sense to use in an application. All of these are cases where using a span or a buffer instead of an array makes a quantifiable difference in allocation, and thus can have a measurable impact on the overall memory profile of a consuming application, contributing to an overall improvement in performance.

System.BitConverter

BitConverter is used to convert between primitive types and bytes, but the current APIs force an unfortunate amount of allocation due to working with byte[]s. We can help to avoid much of that by adding overloads that work with Span<byte>, adding CopyBytes methods instead of GetBytes methods, and adding To* overloads that accept ReadOnlySpan<byte> instead of accepting byte[].

namespace System
{
    public static class BitConverter
    {
        // Matching overloads for GetBytes. Copy to the destination span
        // rather than allocating a new byte[].  Return false if the destination
        // span isn't large enough, which can be determined before any copying.
        public static bool TryWriteBytes(Span<byte> destination, bool value);
        public static bool TryWriteBytes(Span<byte> destination, char value);
        public static bool TryWriteBytes(Span<byte> destination, double value);
        public static bool TryWriteBytes(Span<byte> destination, short value);
        public static bool TryWriteBytes(Span<byte> destination, int value);
        public static bool TryWriteBytes(Span<byte> destination, long value);
        public static bool TryWriteBytes(Span<byte> destination, float value);
        public static bool TryWriteBytes(Span<byte> destination, ushort value);
        public static bool TryWriteBytes(Span<byte> destination, uint value);
        public static bool TryWriteBytes(Span<byte> destination, ulong value);

        // matching overloads for To*
        public static bool ToBoolean(ReadOnlySpan<byte> value);
        public static char ToChar(ReadOnlySpan<byte> value);
        public static double ToDouble(ReadOnlySpan<byte> value);
        public static short ToInt16(ReadOnlySpan<byte> value);
        public static int ToInt32(ReadOnlySpan<byte> value);
        public static long ToInt64(ReadOnlySpan<byte> value);
        public static float ToSingle(ReadOnlySpan<byte> value);
        public static ushort ToUInt16(ReadOnlySpan<byte> value);
        public static uint ToUInt32(ReadOnlySpan<byte> value);
        public static ulong ToUInt64(ReadOnlySpan<byte> value);

        …
    }
}

EDIT 7/18/2017: Updated based on API review. Separated out into https://github.com/dotnet/corefx/issues/22355 for implementation.

System.Convert

As with BitConverter, the Convert class is also used to convert arrays. Most of the members are about converting from individual primitives to other individual primitives, but several work with arrays, in particular those for working with Base64 data. We should add the following methods:

namespace System
{
    public class Convert
    {
        public static string ToBase64String(ReadOnlySpan<byte> bytes, Base64FormattingOptions options = Base64FormattingOptions.None);
        public static bool TryToBase64Chars(ReadOnlySpan<byte> bytes, Span<char> chars, out int bytesConsumed, out int charsWritten, Base64FormattingOptions options = Base64FormattingOptions.None);
        public static bool TryFromBase64String(string s, Span<byte> bytes, out int charsConsumed, out int bytesWritten);
        public static bool TryFromBase64Chars(ReadOnlySpan<char> chars, Span<byte> bytes, out int charsConsumed, out int bytesWritten);
        …
    }
}

Separated out as https://github.com/dotnet/corefx/issues/22417 for implementation.

System.Random

The Random class provides a NextBytes method that takes a byte[]. In many situations, that’s fine, but in some you’d like to be able to get an arbitrary amount of random data without having to allocate such an array, and we can do that with spans. For example, here’s a case where we’re getting some random data only to then want a Base64 string from it: https://referencesource.microsoft.com/#System/net/System/Net/WebSockets/WebSocketHelpers.cs,366

namespace System
{
    public class Random
    {
        public virtual void NextBytes(Span<byte> buffer);
        …
    }
}

Separated out into https://github.com/dotnet/corefx/issues/22356 for implementation.

Primitive Parse methods

Related to BitConverter and Convert, it’s very common to want to parse primitive values out of strings. Today, that unfortunately often involves creating substrings representing the exact piece of text, and then passing it to a string-based TryParse method. Similarly, it's common to convert primitives to strings via a method like ToString. Instead, I suggest we add the following:

namespace System
{
    public struct Int16
    {
        public short Parse(ReadOnlySpan<char> chars, NumberStyles style = NumberStyles.Integer, NumberFormatInfo info = null);
        public bool TryParse(ReadOnlySpan<char> chars, out short result, NumberStyles style = NumberStyles.Integer, NumberFormatInfo info = null);
        public bool TryFormat(Span<char> destination, out int charsWritten, string format = null, IFormatProvider provider = null);
        …
    }
    … // similar Parse/TryParse/TryFormat methods for each of:
    // Int32
    // Int64
    // UInt16
    // UInt32
    // UInt64
    // Decimal
    // Double
    // Single
    // Boolean (no NumberStyle/NumberFormatInfo args)
}

Separated out as https://github.com/dotnet/corefx/issues/22403 for implementation.

Then we should also support parsing a few common types out of ReadOnlySpan<char>:

namespace System
{
    public struct DateTime
    {
        public static DateTime Parse(ReadOnlySpan<char> s, IFormatProvider provider = null, System.Globalization.DateTimeStyles styles = DateTimeStyles.None);
        public static DateTime ParseExact(ReadOnlySpan<char> s, string format, IFormatProvider provider, DateTimeStyles style = DateTimeStyles.None);
        public static DateTime ParseExact(ReadOnlySpan<char> s, string[] formats, IFormatProvider provider, DateTimeStyles style);

        public static bool TryParse(ReadOnlySpan<char> s, out DateTime result, IFormatProvider provider = null, DateTimeStyles styles = DateTimeStyles.None);
        public static bool TryParseExact(ReadOnlySpan<char> s, string format, IFormatProvider provider, DateTimeStyles style, out DateTime result);
        public static bool TryParseExact(ReadOnlySpan<char> s, string[] formats, IFormatProvider provider, DateTimeStyles style, out DateTime result);

        public bool TryFormat(Span<char> destination, out int charsWritten, string format = null, IFormatProvider provider = null);
        …
    }

    public struct DateTimeOffset
    {
        public static DateTimeOffset Parse(ReadOnlySpan<char> input, IFormatProvider formatProvider = null, System.Globalization.DateTimeStyles styles = DateTimeStyles.None);
        public static DateTimeOffset ParseExact(ReadOnlySpan<char> input, string format, IFormatProvider formatProvider, DateTimeStyles styles = DateTimeStyles.None);
        public static DateTimeOffset ParseExact(ReadOnlySpan<char> input, string[] formats, IFormatProvider formatProvider, DateTimeStyles styles);

        public static bool TryParse(ReadOnlySpan<char> input, out DateTimeOffset result, IFormatProvider formatProvider = null, DateTimeStyles styles = DateTimeStyles.None);
        public static bool TryParseExact(ReadOnlySpan<char> input, string format, IFormatProvider formatProvider, DateTimeStyles styles, out DateTimeOffset result);
        public static bool TryParseExact(ReadOnlySpan<char> input, string[] formats, IFormatProvider formatProvider, DateTimeStyles styles, out DateTimeOffset result);

        public bool TryFormat(Span<char> destination, out int charsWritten, string format = null, IFormatProvider formatProvider = null);
        …
    }

    public struct TimeSpan
    {
        public static TimeSpan Parse(ReadOnlySpan<char> input, IFormatProvider formatProvider = null);
        public static TimeSpan ParseExact (ReadOnlySpan<char> input, string format, IFormatProvider formatProvider, TimeSpanStyles styles = TimeSpanStyles.None);
        public static TimeSpan ParseExact (ReadOnlySpan<char> input, string[] formats, IFormatProvider formatProvider, TimeSpanStyles styles = TimeSpanStyles.None);

        public static bool TryParse(ReadOnlySpan<char> input, out TimeSpan result, IFormatProvider formatProvider = null);

        public static bool TryParseExact (ReadOnlySpan<char> input, string format, IFormatProvider formatProvider, out TimeSpan result, TimeSpanStyles styles = TimeSpanStyles.None);
        public static bool TryParseExact (ReadOnlySpan<char> input, string[] formats, IFormatProvider formatProvider, out TimeSpan result, TimeSpanStyles styles = TimeSpanStyles.None);

        public bool TryFormat(Span<char> destination, out int charsWritten, string format = null, IFormatProvider provider = null);
        …
    }

    public class Version
    {
        public static Version Parse(ReadOnlySpan<char> input);
        public static bool TryParse(ReadOnlySpan<char> input, out Version result);
        public bool TryFormat(Span<char> destination, out int charsWritten, int fieldCount = 4);
        …
    }
}

EDIT 7/18/2017: Updated per API review DateTime{Offset} separated out as https://github.com/dotnet/corefx/issues/22358 for implementation. TimeSpan separated out as https://github.com/dotnet/corefx/issues/22375 for implementation. Version separated out as https://github.com/dotnet/corefx/issues/22376 for implementation.

System.Guid

Guids are often constructed from byte[]s, and these byte[]s are often new’d up, filled in, and then provided to the Guid, e.g. https://referencesource.microsoft.com/#mscorlib/system/guid.cs,b622ef5f6b76c10a,references We can avoid such allocations with a Span ctor, with the call sites instead creating a Span from 16 bytes of stack memory. Guids are also frequently converted to byte[]s to be output, e.g. https://referencesource.microsoft.com/#mscorlib/system/guid.cs,94f5d8dabbf0dbcc,references and we can again avoid those temporary allocations by supporting copying the Guid’s data to a span:

namespace System
{
    public struct Guid
    {
        public Guid(ReadOnlySpan<byte> b); // correspond to Guid(byte[] b)

        public bool TryWriteBytes(Span<byte> destination);

        public static Guid Parse(ReadOnlySpan<char> input);
        public static Guid ParseExact(ReadOnlySpan<char> input, string format);

        public static bool TryParse(ReadOnlySpan<char> input, out Guid result);
        public static bool TryParseExact(ReadOnlySpan<char> input, string format, out Guid result);

        public bool TryFormat(Span<char> destination, out int charsWritten, string format = null, IFormatProvider provider = null);
        …
    }
}

EDIT 7/18/2017: Updated per API review Separated out as https://github.com/dotnet/corefx/issues/22377 for implementation.

System.String

It’ll be very common to create strings from spans. We should have a ctor for doing so:

namespace System
{
    public class String
    {
        public String(ReadOnlySpan<char> value);
        …
    }
}

Separated out as https://github.com/dotnet/corefx/issues/22378 for implementation.

In addition to the new ctor on String, we should also expose an additional Create method (not a ctor so as to support a generic method argument). One of the difficulties today with String being immutable is it’s more expensive for developers to create Strings with custom logic, e.g. filling a char[] and then creating a String from that. To work around that expense, some developers have taken to mutating strings, which is very much frowned upon from a BCL perspective, e.g. by using the String(char, int) ctor to create the string object, then using unsafe code to mutate it, and then handing that back. We could handle such patterns by adding a method like this:

namespace System
{
    public delegate void StringCreateAction<TState>(TState state, Span<char> destination);

    public class String
    {
        public static string Create<TState>(int length, TState state, StringCreateAction action);
        …
    }
}

Separated out as https://github.com/dotnet/corefx/issues/22380 for implementation.

The value of overloads like this is amplified when you start using them together. For example, here’s an example of creating a random string: https://referencesource.microsoft.com/#System.Web.Mobile/UI/MobileControls/Adapters/ChtmlTextBoxAdapter.cs,62 This incurs the allocation of a byte[] to pass to Random, a char[] as a temporary from which to create the string, and then creating the string from that char[]; unnecessary copies and allocations.

Finally, we may also want to provide string.Format support for ReadOnlyBuffer<char> as an argument; although passing it as an object will box it, string.Format could write it to the generated string without needing an intermediary string created, so a smaller allocation and avoiding an extra copy.

System.IO.Stream

namespace System.IO
{
    public class Stream
    {
        public virtual int Read(Span<byte> destination);
        public virtual ValueTask<int> ReadAsync(Buffer<byte> destination, CancellationToken cancellationToken = default(CancellationToken));

        public virtual void Write(ReadOnlySpan<byte> source);
        public virtual Task WriteAsync(ReadOnlyBuffer<byte> source, CancellationToken cancellationToken = default(CancellationToken));
        …
    }
}

EDIT 7/18/2017: Updated per API review. Separated out as https://github.com/dotnet/corefx/issues/22381 for implementation.

The base {ReadOnly}Buffer-accepting methods can use TryGetArray to access a wrapped array if there is one, then delegating to the existing array-based overloads. If the buffer doesn’t wrap an array, then it can get a temporary array from ArrayPool, delegate and copy (for reads) or copy and delegate (for writes).

The base Span-accepting methods can do the ArrayPool/delegation approach in all cases.

We’ll then need to override these methods on all relevant streams: FileStream, NetworkStream, SslStream, DeflateStream, CryptoStream, MemoryStream, UnmanagedMemoryStream, PipeStream, NullStream, etc. to provide more efficient Span/Buffer-based implementations, which they should all be able to do. In a few corner cases, there are streams where the synchronous methods are actually implemented as wrappers for the asynchronous ones; in such cases, we may be forced to live with the ArrayPool/copy solution. However, in such cases, the synchronous implementation is already relatively poor, incurring additional costs (e.g. allocations, blocking a thread, etc.), and they’re that way in part because synchronous usage of such streams is discouraged and thus these haven’t been very optimized, so the extra overhead in these cases is minimal.

Once these additional methods exist, we’ll want to use them in a variety of places in implementation around corefx, e.g. https://github.com/dotnet/corefx/blob/d6b11250b5113664dd3701c25bdf9addfacae9cc/src/Common/src/System/Net/WebSockets/ManagedWebSocket.cs#L1140 as various pieces of code can benefit not only from the use of Span/Buffer, but also from the async methods returning ValueTask<T> instead of Task<T>, and the corresponding reduced allocations for ReadAsync calls that complete synchronously.

System.IO.BufferStream and System.IO.ReadOnlyBufferStream

Just as we have a MemoryStream that works with byte[] and an UnmanagedMemoryStream that works with a byte* and a length, we need to support treating Buffer<byte> and ReadOnlyBuffer<byte> as streams. It’s possible we could get away with reimplementing MemoryStream on top of Buffer<byte>, but more than likely this would introduce regressions for at least some existing use cases. Unless demonstrated otherwise, we will likely want to have two new stream types for these specific types:

namespace System.IO
{
    public class BufferStream : Stream
    {
        public BufferStream(Buffer<byte> buffer);
        public BufferStream(Buffer<byte> buffer, bool publiclyVisible);
        public bool TryGetBuffer(out Buffer<byte> buffer);
        … // overrides of all members on Stream
    }

    public class ReadOnlyBufferStream : Stream
    {
        public ReadOnlyBufferStream(ReadOnlyBuffer<byte> buffer);
        public ReadOnlyBufferStream(ReadOnlyBuffer<byte> buffer, bool publiclyVisible);
        public bool TryGetBuffer(out ReadOnlyBuffer<byte> buffer);
        … // overrides of all members on Stream, with those that write throwing NotSupportedException
    }
}

The name of BufferStream is unfortunately close to that of BufferedStream, and they mean very different things, but I’m not sure that’s important enough to consider a less meaningful name. Separated out as https://github.com/dotnet/corefx/issues/22404 for discussion and implementation.

Optional: It’s also unfortunate that all of these various memory-based streams don’t share a common base type or interface; we may want to add such a thing, e.g. an interface that anything which wraps a Buffer<T> can implement… then for example code that uses Streams and has optimizations when working directly with the underlying data can query for the interface and special-case when the underlying Buffer<T> can be accessed, e.g.

namespace System
{
    public interface IHasBuffer<T>
    {
        bool TryGetBuffer(out Buffer<T> buffer);
        bool TryGetBuffer(out ReadOnlyBuffer<T> buffer);
    }
}

We could implement this not only on BufferStream and ReadOnlyBufferStream, but also on MemoryStream, UnmanagedMemoryStream, and even on non-streams, basically anything that can hand out a representation of its internals as buffers. Separated out as https://github.com/dotnet/corefx/issues/22404 for discussion and implementation.

System.IO.TextReader and System.IO.TextWriter

As with streams, we should add the relevant base virtuals for working with spans/buffers:

namespace System.IO
{
    public class TextReader
    {
        public virtual int Read(Span<char> span);
        public virtual ValueTask<int> ReadAsync(Buffer<char> buffer);
        public virtual int ReadBlock(Span<char> span);
        public virtual ValueTask<int> ReadBlockAsync(Buffer<char> buffer);
        …
    }

    public class TextWriter
    {
        public virtual void Write(ReadOnlySpan<char> span);
        public virtual Task WriteAsync(ReadOnlyBuffer<char> buffer, CancellationToken cancellationToken = default(CancellationToken));
        public virtual void WriteLine(ReadOnlySpan<char> span);
        public virtual Task WriteLineAsync(ReadOnlyBuffer<char> buffer, CancellationToken cancellationToken = default(CancellationToken));
        …
    }
}

We’ll then also need to override these appropriately on our derived types, e.g. StreamReader, StreamWriter, StringReader, etc.

Separated out as https://github.com/dotnet/corefx/issues/22406 for implementation.

System.IO.BinaryReader and System.IO.BinaryWriter

As with TextReader/Writer, we also want to enable the existing BinaryReader/Writer to work with spans:

namespace System.IO
{
    public class BinaryReader
    {
        public virtual int Read(Span<byte> span);
        public virtual int Read(Span<char> span);
        …
    }

    public class BinaryWriter
    {
        public virtual int Write(ReadOnlySpan<byte> span);
        public virtual void Write(ReadOnlySpan<char> span);
        …
    }
}

The base implementations of these can work with the corresponding new methods on Stream. Separated out as https://github.com/dotnet/corefx/issues/22428 and https://github.com/dotnet/corefx/issues/22429 for implementation.

System.IO.File

EDIT 6/26/2017: For now I've removed these File methods, as it's unclear to me at this point that they actually meet the core need. If you're repeatedly reading and pass a span that's too small, you're going to need to read again after handling the read data, but with these APIs as defined, each read results in needing to open and the close the file again; similarly if you're using the write APIs to write a file in pieces. At that point you're better off just using FileStream and reading/writing spans and buffers. We should look to add helpers here when we can figure out the right helpers at the right level of abstraction, e.g. should we expose the ability to just create a SafeFileHandle and then have these helpers operate on SafeFileHandle rather than on a string path? The File class provides helpers for reading/writing data from/to files. Today the various “read/write data as bytes” functions work with byte[]s, which mean allocating potentially very large byte[]s to store the data to be written or that’s read. This can be made much more efficient with spans, allowing the buffers to be pooled (in particular for file reads).

// Previously proposed.  API decision: Won't add for now.
namespace System.IO
{
    public static class File
    {
        public static int ReadAllBytes(string path, long offset, Span<byte> span, out long remaining);
        public static ValueTask<(int bytesRead, long bytesRemaining)> ReadAllBytesAsync(string path, long offset, Buffer<byte> buffer);

        public static void WriteAllBytes(string path, ReadOnlySpan<byte> span);
        public static Task WriteAllBytesAsync(string path, ReadOnlyBuffer<byte> buffer);

        public static void AppendAllBytes(string path, ReadOnlySpan<byte> span);
        public static Task AppendAllBytesAsync(string path, ReadOnlyBuffer<byte> buffer);
        …
    }
}

System.Text.StringBuilder

StringBuilder is at the core of lots of manipulation of chars, and so it’s a natural place to want to work with Span<char> and ReadOnlySpan<char>. At a minimum we should add the following APIs to make it easy to get data in and out of a StringBuilder without unnecessary allocation:

namespace System.Text
{
    public class StringBuilder
    {
        public StringBuilder Append(ReadOnlySpan<char> value);
        public StringBuilder Insert(int index, ReadOnlySpan<char> value);
        public void CopyTo(int sourceIndex, Span<char> destination, int count);
        …
    }
}

Separated out as https://github.com/dotnet/corefx/issues/22430.

In addition to these, we’ll also want to consider a mechanism for exposing the one or more buffers StringBuilder internally maintains, allowing them to be accessed as ReadOnlySpan<char>. This is covered separately by dotnet/runtime#22371.

System.Text.Encoding

The Encoding class already exposes a lot of methods for converting between chars/strings and bytes, and it includes overloads for both arrays and pointers. While it might seem onerous to add an additional set for spans, the number of new overloads needed is actually fairly small, and we can provide reasonable default base implementations in terms of the existing pointer-based overloads. These methods could potentially have optimized derived overrides, but more generally they will help the platform to have a consistent feel, avoiding the need for devs with spans to use unsafe code to access the pointer-based methods.

namespace System.Text
{
    public class Encoding
    {
        public virtual int GetByteCount(ReadOnlySpan<char> chars);
        public virtual bool TryGetBytes(ReadOnlySpan<char> chars, Span<byte> bytes, out int charsConsumed, out int bytesWritten);
        public virtual int GetCharCount(ReadOnlySpan<byte> bytes);
        public virtual bool TryGetChars(ReadOnlySpan<byte> bytes, Span<char> chars, out int bytesConsumed, out int charsWritten);
        public virtual string GetString(ReadOnlySpan<byte> bytes);
        public virtual ReadOnlySpan<byte> Preamble { get; }
        …
    }
}

Separated out as https://github.com/dotnet/corefx/issues/22431.

System.Numerics

namespace System.Numerics
{
    public struct BigInteger
    {
        public BigInteger(ReadOnlySpan<byte> value);

        public int GetByteCount();
        public bool TryWriteBytes(Span<byte> destination, out int bytesWritten);

        public static BigInteger Parse(ReadOnlySpan<char> value, NumberStyles style = NumberStyles.Integer, IFormatProvider provider = null);
        public static bool TryParse(ReadOnlySpan<char> value, out BigInteger result, NumberStyles style = NumberStyles.Integer, IFormatProvider provider = null);
        …
    }
}

Separated out in https://github.com/dotnet/corefx/issues/22401 for implementation.

Even with BigInteger being a less-commonly-used type, there are still places even in our own code where this would be helpful: https://source.dot.net/#System.Security.Cryptography.X509Certificates/Common/System/Security/Cryptography/DerSequenceReader.cs,206 In fact, BigInteger’s implementation itself could benefit from this, such as with parsing: https://referencesource.microsoft.com/#System.Numerics/System/Numerics/BigNumber.cs,411

System.Net.IPAddress

It’s very common to create IPAddresses from data gotten off the network, and in high-volume scenarios. Internally we try to avoid creating byte[]s to pass to IPAddress when constructing them, such as by using an internal pointer-based ctor, but outside code doesn’t have that luxury. We should make it possible to go to and from IPAddress without additional allocation:

namespace System.Net
{
    public class IPAddress
    {
        public IPAddress(ReadOnlySpan<byte> address);
        public IPAddress(ReadOnlySpan<byte> address, long scopeid);

        public bool TryWriteBytes(Span<byte> destination, out int bytesWritten);

        public static IPAddress Parse(ReadOnlySpan<char> ipChars);
        public static bool TryParse(ReadOnlySpan<char> ipChars, out IPAddress address);
        public static bool TryFormat(Span<char> destination, out int charsWritten);
        …
    }
}

Even internally this will be useful in cases like these: https://source.dot.net/#System.Net.NetworkInformation/Common/System/Net/SocketAddress.cs,135 https://source.dot.net/#System.Net.NameResolution/Common/System/Net/Internals/IPAddressExtensions.cs,21

EDIT 7/25/2017: Updated based on API review Separated out as https://github.com/dotnet/corefx/issues/22607 for implementation.

System.Net.Sockets

This is one of the more impactful areas for spans and buffers, and having these methods will be a welcome addition to the Socket family.

Today, there are lots of existing methods on sockets, e.g. an overload for sync vs async with the APM pattern vs async with Task vs async with SocketAsyncEventArgs, and overload for Send vs Receive vs SendTo vs ReceiveMessageFrom vs… etc. I do not think we should add an entire new set of span/buffer-based methods, and instead we should start with the most impactful. To me, that means adding the following two synchronous and two Task-based asynchronous overloads for sending and receiving:

namespace System.Net.Sockets
{
    public class Socket
    {
        public int Receive(Span<byte> buffer, SocketFlags flags);
        public int Receive(Span<byte> buffer, SocketFlags socketFlags, out SocketError errorCode);

        public int Send(ReadOnlySpan<byte> buffer, SocketFlags flags);
        public int Send(ReadOnlySpan<byte> buffer, SocketFlags flags, out SocketError errorCode);
        …
    }

    public class SocketTaskExtensions
    {
        public static ValueTask<int> ReceiveAsync(this Socket socket, Buffer<byte> buffer, SocketFlags socketFlags, CancellationToken cancellationToken = default(CancellationToken));
        public static ValueTask<int> SendAsync(this Socket socket, ReadOnlyBuffer<byte> buffer, SocketFlags socketFlags, CancellationToken cancellationToken = default(CancellationToken));
        …
    }
}

Note that I’ve put the ReceiveAsync and SendAsync overloads on the SocketTaskExtensions class as that’s where the existing overloads live. We could choose to instead make these new overloads instance methods on Socket.

In addition, the highest-performing set of APIs with sockets are based on SocketAsyncEventArgs. To support those, we should add the following:

namespace System.Net.Sockets
{
    public class SocketAsyncEventArgs
    {
        public void SetBuffer(Buffer<byte> buffer);
        public Buffer<byte> GetBuffer();
        …
    }
}

The implementation currently stores a byte[] buffer. We can change that to store a Buffer<byte> instead, and just have the existing SetBuffer(byte[]) overload wrap the byte[] in a Buffer<byte>. There is an existing Buffer { get; } property as well. We can change that to just use TryGetArray on the internal buffer and return it if it exists, or else null. The GetBuffer method is then there to support getting the set Buffer<byte>, in case the supplied buffer wrapped something other than an array.

EDIT 7/25/2017: Updated based on API review Separated out as https://github.com/dotnet/corefx/issues/22608 for implementation

System.Net.WebSockets.WebSocket

Similar to Socket is WebSocket, in that it’s desirable to be able to use these APIs with finer-grain control over allocations than is currently easy or possible, and in high-throughput situations. We should add the following members:

namespace System.Net.WebSockets
{
    public class WebSocket
    {
        public virtual ValueTask<ValueWebSocketReceiveResult> ReceiveAsync(Buffer<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
        public virtual Task SendAsync(ReadOnlyBuffer<byte> buffer, CancellationToken cancellationToken = default(CancellationToken));
        …
    }
}

Note that ReceiveAsync is returning a ValueTask rather than a Task (to avoid that allocation in the case where the operation can complete synchronously), and it’s wrapping a ValueWebSocketReceiveResult instead of a WebSocketReceiveResult. The latter is the type that currently exists, but is a class. I’m suggesting we also add the following struct-based version, so that receives can be made to be entirely allocation-free in the case of synchronous completion, and significantly less allocating even for async completion.

namespace System.Net.WebSockets
{
    public struct ValueWebSocketReceiveResult
    {
        public ValueWebSocketReceiveResult(int count, WebSocketMessageType messageType, bool endOfMessage);
        public ValueWebSocketReceiveResult(int count, WebSocketMessageType messageType, bool endOfMessage, WebSocketCloseStatus? closeStatus, string closeStatusDescription);
        public WebSocketCloseStatus? CloseStatus { get; }
        public string CloseStatusDescription { get; }
        public int Count { get; }
        public bool EndOfMessage { get; }
        public WebSocketMessageType MessageType { get; }
     }
}

EDIT 7/25/2017: Updated per API review Separated out as https://github.com/dotnet/corefx/issues/22610 for implementation

System.Net.Http

This area still needs more thought. My current thinking is at a minimum we add the following:

namespace System.Net.Http
{
    public class HttpClient
    {
        public ValueTask<int> GetBytesAsync(Uri requestUri, Buffer<byte> buffer);
        …
    }

    public sealed class ReadOnlyBufferContent : HttpContent
    {
        public ReadOnlyBufferContent(ReadOnlyBuffer<byte> buffer);
        … // override relevant members of HttpContent
    }
}

The GetBytesAsync overload avoids the need for allocating a potentially large byte[] to store the result, and the ReadOnlyBufferContent makes it easy to upload ReadOnlyBuffer<byte>s rather than just byte[]s.

There is potentially more that can be done here, but I suggest we hold off on that until we investigate more deeply to understand what we’d need to plumb throughout the system. HttpClient is itself a relatively small wrapper on top of HttpClientHandler, of which there are multiple implementations (ours and others’), and to plumb a buffer all the way down through would involve new APIs and implementation in HttpClientHandler implementations. Definitely something to investigate.

Separated out as https://github.com/dotnet/corefx/issues/22612 for implementation (ReadOnlyBufferContent... we decided against the GetBytesAsync helper for now in 7/25/2017 API review)

System.Security.Cryptography namespace

Several hashing-related types in System.Security.Cryptography would benefit from being able to work with data without allocating potentially large byte[]s. One is HashAlgorithm, which has several methods for processing an input byte[] and allocating/filling an output byte[]:

namespace System.Security.Cryptography
{
    public abstract class HashAlgorithm
    {
        public bool TryComputeHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        protected virtual void HashCore(ReadOnlySpan<byte> source);
        protected virtual bool TryHashFinal(Span<byte> destination, out int bytesWritten);
        …
    }
}

EDIT 7/25/2017: Updated per API review Separated out as https://github.com/dotnet/corefx/issues/22613 for implementation

Similarly, several overloads would be beneficial on IncrementalHash:

namespace System.Security.Cryptography
{
    public sealed class IncrementalHash
    {
        public void AppendData(ReadOnlySpan<byte> data);
        public bool TryGetHashAndReset(Span<byte> destination, out int bytesWritten);
        …
    }
}

EDIT 7/25/2017: Updated per API review Separated out as https://github.com/dotnet/corefx/issues/22614 for implementation

Similarly, several encryption/signing related types would benefit from avoiding such byte[] allocations:

namespace System.Security.Cryptography
{
    public abstract class RSA : AsymmetricAlgorithm
    {
        public virtual bool TryDecrypt(ReadOnlySpan<byte> source, Span<byte> destination, RSAEncryptionPadding padding, out int bytesWritten);
        public virtual bool TryEncrypt(ReadOnlySpan<byte> source, Span<byte> destination, RSAEncryptionPadding padding, out int bytesWritten);
        protected virtual bool TryHashData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, out int bytesWritten);
        public virtual bool TrySignData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, RSASignaturePadding padding, out int bytesWritten);
        public virtual bool TrySignHash(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, RSASignaturePadding padding, out int bytesWritten);
        public virtual bool VerifyData(ReadOnlySpan<byte> data, ReadOnlySpan<byte> signature, HashAlgorithmName hashAlgorithm, RSASignaturePadding padding);
        public virtual bool VerifyHash(ReadOnlySpan<byte> hash, ReadOnlySpan<byte> signature, HashAlgorithmName hashAlgorithm, RSASignaturePadding padding);
        …
    }

    public abstract class DSA : AsymmetricAlgorithm
    {
        public virtual bool TryCreateSignature(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        protected virtual bool TryHashData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, out int bytesWritten);
        public virtual bool TrySignData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, out int bytesWritten);
        public virtual bool VerifyData(ReadOnlySpan<byte> data, ReadOnlySpan<byte> signature, HashAlgorithmName hashAlgorithm);
        public virtual bool VerifySignature(ReadOnlySpan<byte> rgbHash, ReadOnlySpan<byte> rgbSignature);
        …
    }

    public abstract class ECDsa : AsymmetricAlgorithm
    {
        protected virtual bool TryHashData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, out int bytesWritten);
        public virtual bool TrySignData(ReadOnlySpan<byte> source, Span<byte> destination, HashAlgorithmName hashAlgorithm, out int bytesWritten);
        public virtual bool TrySignHash(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
        public virtual bool VerifyData(ReadOnlySpan<byte> data, ReadOnlySpan<byte> signature, HashAlgorithmName hashAlgorithm);
        public virtual bool VerifyHash(ReadOnlySpan<byte> hash, ReadOnlySpan<byte> signature);
        …
    }
}

EDIT 7/25/2017: Update per API review Separated out of https://github.com/dotnet/corefx/issues/22615 for implementation

One other place where we’d want to be able to use span is ICryptoTransform. Lots of APIs, mainly CreateEncryptor and CreateDecryptor, methods return ICryptoTransform, and it has one it a TransformBlock and TransformFinalBlock method that works with byte[]. We’d really like support in terms of ReadOnlySpan<byte> and Span<byte>. I see four options here:

Do nothing.
Add a second ISpanCryptoTransform interface, and a second set of differently-named CreateEncryptor/Decryptor methods for creating instances of these.
Add a second ISpanCryptoTransform interface that’s also implemented by our implementations, and have consuming code query for the interface and use it if it exists.
If we get support for default interface methods (https://github.com/dotnet/csharplang/issues/52), add these span-based overloads to the interface, with a default implementation in terms of the existing methods. My strong preference is for (4), but that’s based on a feature that doesn’t yet exist. As such, my suggestion is to hold off on this until it’s clear whether such a feature will exist: if it will, do (4) once it does, otherwise do (3).

EDIT 7/25/2017: Decision: 4 if/when the feature exists, and 1 until then.

Other namespaces

There are some other namespaces that could likely benefit from Span/Buffer, but we should probably handle separately from this initial push:

System.Collections. It’s not clear to me to what extent we should add span/buffer-based APIs to collections. CopyTo overloads that take Span<T>? Callback-based methods for exposing the internal storage via a ReadOnlySpan<T> (if that makes sense for the relevant collection)? This needs more thought.
System.Collections.Immutable. Just as T[] has an implicit operator to a Span<T>, we should consider adding an ImmutableArray<T> operator for converting to a ReadOnlySpan<T>.
System.Reflection.Metadata. It’s very specialized, and I’ve not included it in this issue. It can be handled separately.

Known Open Issues

There are several cross-cutting open issues that apply to many of these APIs:

How to handle cases where data is too large for the supplied span/buffer provided? Throw? Simply return a “not big enough” indication? Return how much data was written along with such an indication? Return how much data was written and attempt to return how much more space is needed if possible? Etc. Related to that, do we want to enable an approach where you can call the method with an empty output span and get back an upper bound for the maximum amount of space that will be needed, such that you can then call it again with a sufficient buffer?
Argument names. For overloads, should we generally name arguments “span” for ReadOnlySpan<T> and Span<T>, and “buffer” for ReadOnlyBuffer<T> and Buffer<T>? What if the existing overload uses a different name, like “input” or “b”? What about cases where currently there’s a single argument (e.g. “input”) and then returns an array, and the new overload will have two span args, one input and one output?

Next Steps

Discussion of this “minimal” proposal. My goal is to quickly arrive at a minimal set; there may be additional APIs we want beyond these, but are there any here we shouldn’t have? Are there any absolutely critical ones missing?
API review. I want to review all of these en-mass to ensure we’re being consistent across all of these members.
Once the APIs are approved, open individual issues for each set that’ll be worked on by an individual together.
Implement them, test them, and expose them.

justinvp commented 7 years ago

System.Text.Encoding

Not sure if it should be considered as part of this effort or tracked separately, but it'd be nice if there was a non-allocating way to get an Encoding's preamble. Right now, the only way to get it is by calling Encoding.GetPreamble which returns a byte[] array.

I suggest two new methods that match the naming/style of other methods on Encoding:

namespace System.Text
{
    public class Encoding
    {
        public virtual int GetPreambleByteCount();
        public virtual int GetPreambleBytes(Span<byte> bytes);
    }
}

The first method can be used to get the size of the preamble in bytes (and would also serve as a non-allocating way to determine if the encoding has a preamble at all, which would be useful in optimizations like here).
The second method would be for copying the preamble to a Span<byte> without allocating.

We could provide a default implementation based on the existing GetPreamble and provide more efficient overrides for built-in Encoding subclasses (e.g. UTF8Encoding, UnicodeEncoding, etc.)

jkotas commented 7 years ago

public virtual int GetPreambleByteCount();
public virtual int GetPreambleBytes(Span<byte> bytes)

You can have just one non-allocating ReadOnlySpan<byte> GetPreamble() method instead.

justinvp commented 7 years ago

You can have just one non-allocating ReadOnlySpan<byte> GetPreamble() method instead.

Nice. But what would we call it? We can't overload on return type. GetPreambleSpan?

Edit: Maybe a property? ReadOnlySpan<byte> Preamble { get; }?

justinvp commented 7 years ago

DateTime DateTimeOffset TimeSpan

What about ParseExact?

System.Guid

I see TryParse mentioned. What about Parse and ParseExact?

System.Numerics.BigInteger System.Net.IPAddress

These have Parse and TryParse methods. Should we do those as well?

What about System.Enum's Parse/TryParse?

stephentoub commented 7 years ago

ReadOnlySpan Preamble { get; }

Seems worthwhile. I've updated the proposal with it.

Presumably the base implementation would just be:

public virtual ReadOnlySpan<byte> Preamble => GetPreamble();

and then derived implementations would do something like (e.g. for UTF8):

private static readonly byte[] s_preamble = new byte[] { 0xEF, 0xBB, 0xBF };
...
public override ReadOnlySpan<byte> Preamble => _emitUTF8Identifier ?
    new ReadOnlySpan<byte>(s_preamble) :
    ReadOnlySpan<byte>.Empty;

But that means that a misbehaving consumer could get a reference to the contents via DangerousGetPinnableReference and mutate the shared preamble state that others in the process could subsequently be using. @jkotas, since you suggested this, presumably we're ok with such abuse? Otherwise it seems like we'd need to go with a GetPreambleBytes(ReadOnlySpan<char>) approach like Justin originally suggested, such that the preamble would be copied to the supplied span and the original would remain immutable and inaccessible.

stephentoub commented 7 years ago

System.Guid, System.Numerics.BigInteger, System.Net.IPAddress

Yes, I meant to have both Parse and TryParse on these, but apparently didn't transfer them over properly from my notes. Fixed.

What about ParseExact?

@justinvp, why do you ask about it? You think it's critical to have across the board? I'd left it out intentionally as I've not seen it used frequently or in hot path situations, and it could always be added subsequently, but maybe I'm misinformed?

What about System.Enum's Parse/TryParse?

The implementations today are already allocation-heavy. If/when we can improve the implementations to be better, and if they're used on paths where it matters, then I could see adding them. But right now it doesn't seem like cost of getting a string from a ReadOnlySpan<char> would matter. Do you disagree?

jkotas commented 7 years ago

@jkotas, since you suggested this, presumably we're ok with such abuse?

Yes, that's why the method has the Dangerous prefix.

jnm2 commented 7 years ago

Really not a different situation than different than fixed with a string, right?

jkotas commented 7 years ago

Really not a different situation than different than fixed with a string, right?

Right.

stephentoub commented 7 years ago

Yes, that's why the method has the Dangerous prefix.

Ok. It just feels different than some of the other cases we've discussed, as it's not just allowing someone to mutate local state that was supposed to be read-only, but global state that's shared by others. But I'm ok with it as long as we understand the consequences. And I realize it's just an extension of some of the other cases we've discussed.

Really not a different situation than different than fixed with a string, right?

That requires unsafe code and pointers, at which point all bets are off. This doesn't.

jnm2 commented 7 years ago

That requires unsafe code and pointers, at which point all bets are off. This doesn't.

Are there scenarios where all bets aren't off when you use APIs with Dangerous in the name?

stephentoub commented 7 years ago

Are there scenarios where all bets aren't off when you use APIs with Dangerous in the name?

There's a difference between a naming convention and having unfettered ability to write to any arbitrary location in memory. I'm not arguing we should do something differently, just pointing out there is a meaningful difference between the two, at least to me.

jkotas commented 7 years ago

That requires unsafe code and pointers, at which point all bets are off. This doesn't.

The unsafe keyword just prevents you from using unsafe "all bets are off" language features. It does not prevent you from using equivalent library APIs. It has been the case since .NET Framework 1.0.

https://github.com/dotnet/roslyn-analyzers/issues/972 is a suggestion on how to fix this hole.

having unfettered ability to write to any arbitrary location in memory.

Marshal APIs (and many other more subtle APIs) let you do that without unsafe keyword since .NET Framework 1.0.

Are there scenarios where all bets aren't off when you use APIs with Dangerous in the name?

Yes, it has been the convention that we have been using recently. It is not new. It has been used (inconsistently) in the past - e.g. for APIs like SafeHandle.DangerousAddRef().

stephentoub commented 7 years ago

Marshal APIs (and many other more subtle APIs) let you do that without unsafe keyword since .NET Framework 1.0.

Yeah, good point.

jnm2 commented 7 years ago

Heh heh, reminds me of https://stackoverflow.com/questions/1128315/find-size-of-object-instance-in-bytes-in-c-sharp#comment45853923_1128674 😀

justinvp commented 7 years ago

What about ParseExact?

@justinvp, why do you ask about it? You think it's critical to have across the board? I'd left it out intentionally as I've not seen it used frequently or in hot path situations, and it could always be added subsequently, but maybe I'm misinformed?

The reason I asked about ParseExact was more of for consistency. It seems like there may be cases where you'd want to use ParseExact without having to allocate a substring, and it'd be a little annoying that you can do it for Parse/TryParse but not ParseExact. I don't have a specific need at the moment, so I'm fine with holding off, but I do think we'll eventually want it.

What about System.Enum's Parse/TryParse?

The implementations today are already allocation-heavy. If/when we can improve the implementations to be better, and if they're used on paths where it matters, then I could see adding them. But right now it doesn't seem like cost of getting a string from a ReadOnlySpan<char> would matter. Do you disagree?

I don't disagree. The reason I asked was for consistency with the other Parse/TryParse methods that are being proposed to be added throughout the base classes and just wanted to make sure it wasn't an oversight.

danmoseley commented 7 years ago

@davidfowl

stephentoub commented 7 years ago

@KrzysztofCwalina, @terrajobst, how are we thinking about ReadOnlySpan<char> as a substitution for strings? For example, I looked around corefxlab, but couldn't find a set of operations on ReadOnlySpan<char> that mirrored the various instance methods on string, like Trim, Replace, Compare, StartsWith, etc. Do they exist somewhere? I don't think this issue is the right place to propose this, but as I'm looking at what it'll take to implement some of these methods, I'm finding a need for at least some such operations, and if we expect ReadOnlySpan<char> to be generally usable in this fashion, others will obviously need some.

EDIT: I see we have some extensions related to this in System.Memory in corefx. We'll likely need some of those extensions moved to corelib to enable implementations there, in particular support for faster copying and comparing.

stephentoub commented 7 years ago

(I edited the initial post to tweak some of the signatures based on offline feedback from @KrzysztofCwalina , related to my open issues question around status reporting. It'd be great if folks could look at the specifics of the APIs and whether they meet needs regarding that.)

davidfowl commented 7 years ago

We also need to add methods to convert from primitive types to Span<byte> in different encodings. For example here is the code I need to write today if I want to format a integer into a UTF8 encoded byte[] here's what the code looks like:

var lengthString = payload.Length.ToString(CultureInfo.InvariantCulture);
var utf8LengthBytes = Encoding.UTF8.GetBytes(lengthString);

Get the UTF16 representation as a string
Encoding the UTF16 string into a UTF8 byte[]

Instead, it would be great if we could do something like this:

var buffer = new byte[10]; // max int
int written = payload.Length.CopyTo(buffer, CultureInfo.InvariantCulture, Encoding.UTF8);

Format directly into a Span<byte> with the right encoding without the intermediate transformations:

public struct Int32
{
    public int CopyTo(Span<byte> bytes, CultureInfo cultureInfo, Encoding encoding);
}

stephentoub commented 7 years ago

Thanks, @davidfowl. I agree that was an omission from my proposal and we should definitely include such core formatting APIs. A few thoughts/questions:

I assume that should be Span rather than ReadOnlySpan
I wonder if we really want to combine the character generation and the encoding like that. Would it not be more composable to have a separate public bool TryCopyTo(Span<char> chars, out int charsWritten, IFormatProvider provider = null); effectively to get the ToString representation, and then separately that could be used with the Encoding methods? There could then be a helper on top to combine them if necessary, but this would allow TryCopyTo to be used in situations where you really do just want the characters, formatting strings to larger strings, etc.

davidfowl commented 7 years ago

It's definitely more composable but less efficient when I just want encoded bytes (not chars) without the intermediate step. Maybe the UTF8 focused formatters in corefxlab work better for this scenario since they are particularly optimized for UTF8 bytes (the web pretty much decided on UTF8 http://utf8everywhere.org/)

stephentoub commented 7 years ago

but less efficient when I just want encoded bytes (not chars) without the intermediate step. Maybe the UTF8 focused formatters in corefxlab work better for this scenario

Yeah, I think that's probably the case. I'm inclined to say the core types in coreclr/corefx should start with the constituent operations, and we start with the more efficient/combined ones in the more-advanced-optimized-for-UTF8-and-every-last-ounce-of-perf-specialized-parsing/formatting library. @KrzysztofCwalina, opinion?

shaggygi commented 7 years ago

Will there be an easy way to parse to types based on a length instead of number of expected bytes based on types? Would it be possible to overload with a "length" or "numberOfBytes" parameter to use when parsing similar to the Invariant types (e.g. InvariantUtf8) CoreFxLab?

stephentoub commented 7 years ago

@shaggygi, I'm unclear what you're asking for. Given, e.g.

public struct Int16
{
    public bool TryParse(ReadOnlySpan<char> chars, out short result, NumberStyles style = NumberStyles.Integer, NumberFormatInfo info = null);
    …
}

what would the value of an overload that takes a length be? Why wouldn't you just slice the chars?

Or are you specifically talking about skipping chars entirely and parsing from bytes to types based on a specific encoding? For that I'm inclined to say that such use is the domain of the corefxlab libs, at least for this first go-around with getting a core set of APIs added to the core types.

shaggygi commented 7 years ago

@stephentoub I might have overlooked the Invariant APIs in CoreFxLab. There are cases today where I have a byte[] and need to read a number of bytes into a certain type (e.g. uint). Being 4 bytes for this type, it seems like I needed 4 bytes to convert to/from the type and byte[].

I was thinking the Invariant APIs included a "length" parameter so you could specifically tell how many bytes to use during the conversions. For example, there is a protocol I work with that includes number of bytes that are included in the packet further downstream to process to read a particular type. Since I know how many bytes, I could pass that into the method to parse into a particular type. I guess I could cast and use a ushort if I needed to use 2 bytes instead of 4 in this example.

So if my assumption was correct on the Invariant APIs, I was suggesting to add an overloading method to this topic for the TryParse and ToXXX methods.

stephentoub commented 7 years ago

@shaggygi, sorry, I'm still not clear on what you're suggesting. Can you propose a complete API signature and an example usage of that so that I can better understand the scenario? Sorry for being dense.

mellinoe commented 7 years ago

Maybe the UTF8 focused formatters in corefxlab work better for this scenario since they are particularly optimized for UTF8 bytes (the web pretty much decided on UTF8 http://utf8everywhere.org/)

FWIW, it's not just the web. Almost all of the game- and graphics-related native libraries also use UTF8 exclusively, even on Windows. OpenGL, Vulkan, SDL2, Assimp, imgui are some examples.

stephentoub commented 7 years ago

I'm walking a potentially fine line here. We've got all of these formatters and parsers in the works in corefxlab, and they're focused on and tuned for these UTF8 scenarios, while also employing powerful but more complex APIs. I could see us taking a few routes:

That's all we do around formatting/parsing for primitives, and we don't add any such APIs to the core primitives in coreclr/corefx, e.g. we don't add int.{Try}Parse overloads.
We add the "simple" overloads to coreclr/corefx and leave the more advanced and powerful APIs to this separate package. This does mean there's some overlap in functionality, but it also means that simple code remains simple, albeit with fewer allocations, e.g. in a discoverable way, you can tweak existing code from int.Parse(str.Substring(1,3)) to instead be int.Parse(str.AsSpan().Slice(1,3)), and now you're no longer allocating that intermediate string. But if you wanted to do something more intense, like know how many chars were consumed from the source span so that could move your parser along to parse the next item, you'd want to go to the dedicated formatters/parsers library.
Add all of the functionality from the formatters/parsers library to the core types in coreclr/corefx.

In this issue I've taken the approach of (2), which essentially boiled down to looking at existing APIs and creating overloads that created the smallest meaningful mapping between arrays/strings/pointers and spans/buffers. We could of course add additional UTF8-focused overloads, but I do think they would be additive rather than replacements, because we do want those "easy" situations like I mentioned to remain easy, and then as we add these we very quickly end up with (3). Maybe that's where we'll eventually end up, but I'm very hesitant to start there.

Reasonable?

shaggygi commented 7 years ago

@stephentoub Sorry for not explaining well. Let me try again.

Say I have an array like below. This tries to simplify as it represents 1 byte for length and the remaining bytes (3) is for the value (1193046) of the uint type. byte[] data = { 0x03, 0x56, 0x34, 0x12 };

To my knowledge, you would not be able to get the value just by using the following method because there is only 3 bytes.

public static ushort ToUInt16(ReadOnlySpan<byte> value);

I'm suggesting it would be nice to have something like:

public static uint ToUInt32(ReadOnlySpan<byte> value, int length);

Where "length" is the number of bytes to use to get the uint value. I'm under the impression that is what the CoreFxLab Invariant APIs will allow. Meaning, specify a number of bytes to use to get value for a certain type. Hence the name "invariant" I thought.

NOTE: First, I would need to write the length of 3 to the byte array to represent the next 3 bytes represent the value, but I didn't include just to focus on getting the value to array below.

This goes the same for converting the value to array, as well. If I use the following, the result of bytes would be { 0x56, 0x34, 0x12, 0x00 }. I don't want the 4th byte (0x00) in this case.

public static bool TryCopyBytes(uint value, Span<byte> bytes); TryCopyBytes((uint)1193046, bytes);`

It seems like you would have another "length" parameter like the following where the result of bytes would be { 0x56, 0x34, 0x12 }.

public static bool TryCopyBytes(uint value, Span<byte> bytes, int length); TryCopyBytes((uint)1193046, bytes, 3);

KrzysztofCwalina commented 7 years ago

Yeah, I think that's probably the case. I'm inclined to say the core types in coreclr/corefx should start with the constituent operations, and we start with the more efficient/combined ones in the more-advanced-optimized-for-UTF8-and-every-last-ounce-of-perf-specialized-parsing/formatting library. @KrzysztofCwalina, opinion?

Yeah, I think System.Text.Primitives (from corfxlab) should be used to format to UTF8.

KrzysztofCwalina commented 7 years ago

One library I would add to the list is System.Text.Encodings.Web. It would be good to add methods that can encode from and into Span, and possibly UTF8 Span.

maloo commented 7 years ago

Any chance we could fix Parse/TryParse signatures so we don't have to keep wrapping them in extension methods.

public static int Parse(ReadOnlySpan<char> chars, int @default = default, NumberStyles style = NumberStyles.Integer, NumberFormatInfo info = null); public static int? TryParse(ReadOnlySpan<char> chars, NumberStyles style = NumberStyles.Integer, NumberFormatInfo info = null);

These allows things like var x = int.Parse("bad", 5); or list.Select(badStr => new { x = int.TryParse(badStr) ?? 5 })

It also feels more natural than the int.TryParse("bad", out var x) ? x : 5

jnm2 commented 7 years ago

The Try* pattern has really grown on me with out var. I don't really like the fact that I have to write .Value all over the place with the nullable pattern.

Drawaes commented 7 years ago

I also have rekindled my fondness for the Try pattern, it's cool again :)

jnm2 commented 7 years ago

Especially once they allow deconstruction in place like out var (x, y) =)

KrzysztofCwalina commented 7 years ago

Some additional feedback after reviewing this carefully:

Int16 (and others) will have TryFormat. Why not DateTime, DTO, TS, GUID, etc.?
Should string have AsSpan, as opposed to the extension method we have today?
Why do we need BinaryReader.Read(Span\<char>)? Wouldn’t Span overload be enough? Remember that you can cast Span\<char> to Span\<byte> at very low cost.
I think it would be great to have File.AppendAllBytes in addition to WriteAllBytes.
HttpClient.GetBytesAsync needs to return the number of bytes that were written to the buffer. Also, what happened when the response does not fin into the buffer?
As to collections. I think it would be good to add AsSpan to List\<T>, and possibly other array-backed collections.

Lastly, @stephentoub, great thanks for writing this proposal! It's great and detailed!

stephentoub commented 7 years ago

Thanks for reviewing, @KrzysztofCwalina!

One library I would add to the list is System.Text.Encodings.Web. It would be good to add methods that can encode from and into Span, and possibly UTF8 Span.

Makes sense. I'm not very familiar with the library, but it looks like it'd be the base TextEncoder class? Could you suggest what overloads should be added there?

Int16 (and others) will have TryFormat. Why not DateTime, DTO, TS, GUID, etc.?

Yeah, I should probably add something for those. Done.

Should string have AsSpan, as opposed to the extension method we have today?

No strong opinion. Let's consider that as part of dotnet/corefx#21395 rather than here.

Why do we need BinaryReader.Read(Span<char>)? Wouldn’t Span overload be enough? Remember that you can cast Span<char> to Span<byte> at very low cost.

BinaryReader's ctor takes an Encoding, so using the Span<byte>-based overload and then just casting is not necessarily the same as using the existing char[]-based overload, and since such a char[]-based overload exists, seems like we should have the Span<char>-based one, too.

I think it would be great to have File.AppendAllBytes in addition to WriteAllBytes.

Seems reasonable. Added.

HttpClient.GetBytesAsync needs to return the number of bytes that were written to the buffer. Also, what happened when the response does not fin into the buffer?

Yeah, that was an oversight, I've added the number of bytes written. As for what happens if the response doesn't fit into the buffer, you asked for this method, so you tell me :smile:... what behavior would you want?

As to collections. I think it would be good to add AsSpan to List, and possibly other array-backed collections.

This makes me nervous, exposing the internal storage used by the list and in a way that's trivial to get wrong, since the List<T> might replace its buffer any time various operations are called on it. I of course see the value, but if we did decide to do this, at the very least it would need to be called DangerousGetSpan, or something like that. Probably makes sense to open a separate issue to discuss adding such a method to collections, and which ones, if you'd like to do that.

svick commented 7 years ago

As to collections. I think it would be good to add AsSpan to List, and possibly other array-backed collections.

This makes me nervous, exposing the internal storage used by the list and in a way that's trivial to get wrong, since the List<T> might replace its buffer any time various operations are called on it. I of course see the value, but if we did decide to do this, at the very least it would need to be called DangerousGetSpan, or something like that. Probably makes sense to open a separate issue to discuss adding such a method to collections, and which ones, if you'd like to do that.

That issue already exists: https://github.com/dotnet/corefx/issues/19814.

stephentoub commented 7 years ago

That issue already exists

Thanks, I thought I remembered having this discussion recently :smile:

stephentoub commented 7 years ago

@shaggygi, thanks for explaining further. I could be wrong, but to me this does not seem like a very common need. I can imagine a variety of custom encoding needs folks might have, and for that there's nothing stopping you from writing your own helpers, e.g. pseudo-code:

public static unsafe bool TryCopyBytes(uint value, Span<byte> bytes, int length) =>
    new Span<byte>((byte*)&value, 4).Slice(0, length).CopyTo(bytes);

axelheer commented 7 years ago

Not only the API of BigInteger would benefit from this, but the internal implementation should be much safer and maybe a little bit faster based on this.

We currently use pointers to spare some index calculations (performance) and "abstract" the differences between ordinary arrays and stackalloc based stuff --> use spans instead
We currently copy the array containing the actual number after every operation to trim leading zeros --> use span.slice instead?
And maybe there's more...

Should I file a separate issue?

stephentoub commented 7 years ago

Should I file a separate issue?

Yes, thanks. There are many places Span could be used internally in implementation, to clean up code that's currently unsafe, to help avoid allocation in places where stack memory could be used instead of heap, to consolidate code paths where stack-vs-heap paths were already employed, etc.

danmoseley commented 7 years ago

@JeremyKuhne as mentioned offline, a number of these could be good projects for @MaggieTsang @niustephanie @nchikanov eg the BitConverter ones. They could separate those off and go through API review, implemtation etc.

stephentoub commented 7 years ago

They could separate those off and go through API review, implemtation etc.

Let's please wait until we've completed reviewing this issue all-up. At that point I don't believe there will be more review needed for the BitConverter APIs in particular. Also please check with me before starting on any work for these.

stephentoub commented 7 years ago

All of the proposed APIs have been reviewed, and individual issues created for each piece. As such, this issue has served its purpose and I'm closing it. Further discussion on individual topics can/should be done on the associated issues, or new issues created for anything deemed missing. Thanks!

clrjunkie commented 7 years ago

@stephentoub Isn’t it true that in order gain value from all these overloads one would require to allocate the Span backed memory region from a pool of heap memory, like a pre-allocated array of value/reference types, for all inputs and outputs? If that’s the case, I would suggest highlighting this at the very beginning.

stephentoub commented 7 years ago

Isn’t it true that in order gain value from all these overloads one would require to allocate the Span backed memory region from a pool of heap memory, like a pre-allocated array of value/reference types, for all inputs and outputs?

No. While that's one use case, there are others, for example slicing a string to get a substring but without allocating/copying to a new one, or using stackalloc to get the memory to use with the span.

clrjunkie commented 7 years ago

for example slicing a string to get a substring but without allocating/copying to a new one,

I can appreciate the value of having a movable window over the String characters, but wouldn't resulting Span of characters incur one allocation and one more when converting it back to String? Point being isn't this really beneficial when working with buffers (of characters in this case) from a pool?

or using stackalloc to get the memory to use with the span.

I understand, but that's not part of above Api's contract.

stephentoub commented 7 years ago

wouldn't resulting Span of characters incur one allocation

No, Span is a struct.

Point being isn't this really beneficial when working with buffers (of characters in this case) from a pool?

No, there are many use cases where a heap-based pool isn't involved. The ones I've already mentioned, others where you get native memory handed to you over interop, etc. It's of course also useful in the pooling scenario, but it's absolutely useful beyond that.

but that's not part of above Api's contract.

I don't understand the comment. With spans, I can write a single method that works in terms of a span, and then I can pass to it a span that represents either managed or unmanaged memory, e.g.

Span<byte> buffer;
if (size < StackThreshold)
{
    byte* p = stackalloc byte[size];
    buffer = new Span<byte>(p, size);
}
else
{
    buffer = new byte[size];
}
CallMethod(buffer);

dotnet / runtime