dotnet / csharplang

The official repo for the design of the C# programming language
11.53k stars 1.03k forks source link

Provide support for fixed capacity, variable length value type (inline) strings. #2099

Closed Korporal closed 5 years ago

Korporal commented 5 years ago

Strings in C# are perceived as buffers with an (to all intents and purposes) unlimited capacity and for this reason cannot be stored inline as primitive types are. I'm proposing that consideration be given to introducing an additional string type which has a capacity declared at runtime, and thus a maximum possible length.

This then makes it possible to define classes or structs which contain strings yet have these string appear inline, within the datum's memory much as primitive types are.

This is a problem that came up in a sophisticated very high performance client server design in which we got huge benefits by being able to define fixed length messages that contained strings. In our case we simulated fixed capacity strings as properties that encapsulated fixed buffers (char or byte). This worked well but was messy because the language offers no way for us to 'pass' (at compile time) a length into a fixed buffer declaration, one must actually declare the fixed buffer explicitly with a constant.

As a result we created a huge family of types named like this: ANativeString_64 and UNativeString_128 (ansi and unicode variants) and so on, as I say this worked but was messy.

Each type was a pure struct (as in the new generic constraint 'unmanaged') so when used as member fields in other structs left that containing struct pure, giving us contiguous chunks of memory that contained strings.

As I say this worked very well but was messy under the hood and challenging to maintain.

So could we consider a new primitive type:

string(64) user_name;

for example?

Such strings could be declared locally resulting in a simple stack allocated chunk, or as members within classes/structs in which case they appear inline just like fixed buffers do...

(just to be clear I'm not seeking the capacity to be defined at runtime but at compile time, and I know my syntax won't work but wanted to convey the idea).

Korporal commented 5 years ago

One can envisage a new type declaration syntax:

public struct Text (int X)
{
   private fixed char text[X];
   // All the String conversions and stuff go here...
}

Where the language implementation fully recognizes that X must be a compile time constant when an instance of Text is declared:

public struct SomeSimpleMessage
{
   public Text(64) OptionName;
   public decimal Price;
}

This - very reasonable - request for a new language feature would suffice, then it is easy for developers to do these fixed capacity string things (and many others things) with ease.

CyrusNajmabadi commented 5 years ago

I can't discuss the supposed complexity for the C# language and CLR since I do not have the degree of insight into the internals as some here do.

You mentioned that there "there is no sound reason". But things like complexity are definitely sound reasons.

Korporal commented 5 years ago

I can't discuss the supposed complexity for the C# language and CLR since I do not have the degree of insight into the internals as some here do.

You mentioned that there "there is no sound reason". But things like complexity are definitely sound reasons.

@CyrusNajmabadi

Perhaps I shouldn't have phrased myself that way. My point is that numerous languages (PL/I, Pascal, COBOL), for many decades supported fixed capacity inline string data types either varying in length or fixed in length. Given their presence in such older languages (compilers for which I've worked on personally) I wouldn't presume their implementation to be particularly complex, of course I can't comment on how the CLR design may or may not lend itself to this.

YairHalberstadt commented 5 years ago

What you're suggesting is complete different from the current type system in C#. It would require enormous work to be done for it to work nicely.

Could it be done?

Sure.

But I'd much rather the team put in the effort to deliver on a feature that would more generally be helpful to a greater number of people.

If you genuinely believe in this, I would suggest you hack up a prototype in Roslyn. That would serve as a useful basis from which to start. With a concrete design and implementation, this feature would have a lot more of a chance.

YairHalberstadt commented 5 years ago

A more general solution would be to allow using constants as parameters in Generics. This would actually have a lot of nice properties, and can be done in C++ for example using templates.

For example you would be able to statically prove that the size of an mn matrix multiplied by an no matrix is m*o.

However this would require changes to the runtime, and almost definitely, the cost is not worth the benefit.

CyrusNajmabadi commented 5 years ago

for many decades supported fixed capacity inline string data types either varying in length or fixed in length.

Sure. And C# supports that as well. It just doesn't put in much effort into making it pleasant. Furthermore, being pleasant in the language is only half the story. You'd still need the ecosystem around the language to support this stuff.

arekbal commented 5 years ago

The fact that I can can't create a simple struct containing text fields and have that struct be a contiguous block is IMHO a weakness and one that should be addressed if the C# language team seek to continue to improve performance, particularly in the growing area of IOT technology (think .Net Micro Framework for example).

Again, why can't you code generate it? And/or use custom string type based on data view with refs/pointers. A lot of custom options out there.

public struct Text (int X) { private fixed char text[X]; // All the String conversions and stuff go here... } I hope you are aware that Text(1) is not binary compatible with Text(2) and something like Text[] wouldn't work anymore the same way it did? What you hope for here is something like Text<int(1)>. Which for limited quantity could be replaced with something like Text_1 and Text_2 if you would dare to go into code generation. But for some I would just copy paste it with snippets (2, 4, 8, 16, 32). IMHO you should stick with T4 or some custom console app (Razor might be quite nice and maintainable alternative for simple code generation, Roslyn is another one).

In case C# would go with the idea of introducing new type you mention, we would have special case struct which would be incompatible with so much of the features most users would consider it useless. Shortcoming would be plentiful. Actually same problem applies to fixed buffers.

I would prefer to first have generics supporting constants with some option of type compatibility left(another specifier over where?) in case of these fixed sized but generic fields.

Korporal commented 5 years ago

@arekbal

The fact that I can can't create a simple struct containing text fields and have that struct be a contiguous block is IMHO a weakness and one that should be addressed if the C# language team seek to continue to improve performance, particularly in the growing area of IOT technology (think .Net Micro Framework for example).

Again, why can't you code generate it?

Code generate what?

And/or use custom string type based on data view with refs/pointers. A lot of custom options out there.

You may be right, so show me an example of what you mean.

public struct Text (int X) { private fixed char text[X]; // All the String conversions and stuff go here... } I hope you are aware that Text(1) is not binary compatible with Text(2) and something like Text[] wouldn't work anymore the same way it did?

Of course I'm aware that a Text(1) would not be binary compatible with a Text(2) anymore than a Int32 would not be binary compatible with a Decimal, what's your point here?

What you hope for here is something like Text<int(1)>. Which for limited quantity could be replaced with something like Text_1 and Text_2 if you would dare to go into code generation. But for some I would just copy paste it with snippets (2, 4, 8, 16, 32). IMHO you should stick with T4 or some custom console app (Razor might be quite nice and maintainable alternative for simple code generation, Roslyn is another one).

The T4 we have works well but as I explained we simply cannot generate thousands of differently named types differing only in the length of their internal buffers and being content with some family of values although OK for most cases, is also frustrating in some cases.

In case C# would go with the idea of introducing new type you mention, we would have special case struct which would be incompatible with so much of the features most users would consider it useless.

I don't know how you reached that conclusion.

Shortcoming would be plentiful. Actually same problem applies to fixed buffers.

I would prefer to first have generics supporting constants with some option of type compatibility left (another specifier over where?) in case of these fixed sized but generic fields.

From what's been said here so far, being able to pass a compile time constant into a class/struct in such a way it can be used - as a constant - inside that class/struct strikes me as a good starting point that would solve my problem easily and perhaps enable a whole class of similar problems to be easily solved going forward.

Of course the definition of a struct like this:

public struct ANSIString (int Capacity)
{
   public fixed Byte buffer[Capacity];
   // various properties, string conversions etc follow below:
}

Would reside in some new, distinct assembly (presumably) and would requires some support for compile time finalization but that is something that's technically possible in principle I imagine.

CyrusNajmabadi commented 5 years ago

@Korporal i don't actually understand what your proposal actually is anymore. Could you condense it down to something simple, and plainly explain what actual language changes you'd be looking for? It would be valuable if you could explain what those language features would actually compile down to.

CyrusNajmabadi commented 5 years ago

Also, i stand by my point of: You'd still need the ecosystem around the language to support this stuff.

So this seems incredibly marginal and overly specific to add this language feature just to support a single serialization request. Once it was used for that purpose, it wouldn't have any other value.

So the value-prop here is just way out of whack. The choices are:

  1. you do this yourself with your own library code, with a little extra effort.
  2. the language does the work (at about 1000x the cost) to add this, but then you use it, and that's it...
arekbal commented 5 years ago

@Korporal

And/or use custom string type based on data view with refs/pointers. A lot of custom options out there.

You may be right, so show me an example of what you mean.

https://gist.github.com/arekbal/c6a90d324d2e38da4bb3d0504bfd4393 You would generate code like this from attributes on your "meta" types.

For some inspiration in this approach (pointers to larger buffer) you could have a look at this code: https://github.com/secana/PeNet

Korporal commented 5 years ago

@Korporal i don't actually understand what your proposal actually is anymore. Could you condense it down to something simple, and plainly explain what actual language changes you'd be looking for? It would be valuable if you could explain what those language features would actually compile down to.

@CyrusNajmabadi

This is simple, I'd like to see the language support fixed capacity strings analogous to existing fixed capacity buffers but providing string semantics. Like fixed buffers such "strings" would be allocated as contiguous blocks of memory situated inline within the field block for the instance.

Such "strings" will be fixed capacity precisely because we want the physical size of the containing struct itself - when serialized - to always be the same length for that struct irrespective of any of it's fields' values.

Whether this was implemented with such "strings" having a fixed current length or a dynamic current length is of interest too but less of a concern. (For example one could include a "length" word that describes how many chars are "in" the string currently (whereas the capacity dictates the max number of chars the "string" could ever hold).

Because the domain is high performance binary serialization one could provide support for single byte strings and unicode strings (we did this by generating Charbased implementations and Bytebased and named these UString_32 or AString_128 for example).

That's pretty much it I think, HOW to do this is of course a choice from several options and I'm not familiar enough with the technicalities of the compiler or CLR to do more that throw out ideas, hints.

One of these was to allow a compile-time constant to be passed into a declaration like this (contrived first pass):

public struct ANSIString (int Capacity) : ICloneable, IComparable, IComparable<string>, IConvertible, IEquatable<string>, System.Collections.Generic.IEnumerable<char>
{
   public fixed Byte buffer[Capacity];
   // various properties, string conversions etc follow below:
}

Then the consumer can creates instances of ANSIString (inside their own structs) and those consumer defined structs would be pure value types and always be the same length.

Such consumer defined structs that contained instance members of type ANSIString would then also be capable of being used in generics where a T is constrained to unmanaged.

This is really it, I cannot say much more other than that - only those with more expertise can make a decision as to whether this is doable and at what cost.

Serializing and deserializing today is quite lame, even with protocol buffers. All of the fussing and fiddling that goes on under the hood is quite costly, as I say we can serializes small example structs (containing fields like AString_32 or UString_64 etc) at least ten times as rapidly as protocol buffers because there is almost no work to do!

Of course if one doesn't ever care about or need strings in their serialized items (which in our case become messages transmitted around a network) then today's C# is fine (although the CLR does not offer this kind of serialization so one must engineer it as we did).

As I said we can serialize example structs into Byte[] blocks on a single thread(core) at the rate of close to a million per second on an i7-3960 - I was unaware of any comparable technology a decade ago and nothing has changed.

mikedn commented 5 years ago

How is the length of such strings supposed to be encoded? Are they length prefixed or nul terminated? NUL terminated I'd guess, since your example doesn't show any length field.

Anyway, it's pretty unlikely that you'll ever see something like public struct ANSIString (int Capacity). The runtime does not support that and adding support for that would be quite a bit of work.

The C# compiler could perhaps emulate that to an extent, similar to the way it simulates fixed buffers today, but it's highly problematic to do so. All of the sudden you introduce this notion of a parametrized type into the C# type system, that doesn't have a direct mapping to a runtime type and has to be emulated by creating a bunch of runtime types. That's not going to fly too far, if at all.

Alternatives that involve existing/extended fixed size buffers and span & co., such as

And could have hte language auto make a Span out a fixed size buffer for you. Then, you could add methods/functionality to these guys, while still maintaining the benefits of the fixed-size stuff.

are far more feasible.

But then, going back to my original question, the case of strings is probably a bit more complicated due to the fact that length and capacity are 2 different things. The use of length prefixed strings would probably complicate any implementations that relies on, say, implicit conversion between fixed size buffers and span. And the use of NUL terminated strings isn't without problems, if you say to the users - hey, that span I just handed over to you, you know, has a NUL terminator in the middle and requires special manipulation - they're probably not going to be too happy.

sharwell commented 5 years ago

💭 It seems like this feature is already in the process of being supported via Span<T>/Memory<T>.

CyrusNajmabadi commented 5 years ago

What does "but providing string semantics" mean?

This is simple, I'd like to see the language support fixed capacity strings analogous to existing fixed capacity buffers

What is the major difference between the two? Why would a fixed-capacity-buffer not suffice? Why would a Span not suffice?

Because the domain is high performance binary serialization

This goes back to waht i was saying before. This is a niche optimization of a niche area. There are already solutions for high-perf binary serialization. And creating an entire language feature to help (and it's not even clear to me what it helps with) one specific area like that seems like a poor decision.

// various properties, string conversions etc follow below:

Can you be more specific? What properties? Why can't you just write the above already today with, say Span<char> along with extension methods that give you the shape you want?

Serializing and deserializing today is quite lame, even with protocol buffers. All of the fussing and fiddling that goes on under the hood is quite costly,

You're conflating things. There is nothing about a language feature that dictates that the impl will be 'fast'. Someone might provide you with such a 'length-encoded-string' concept, and serialization/deserialization could be slow

as I say we can serializes small example structs (containing fields like AString_32 or UString_64 etc) at least ten times as rapidly as protocol buffers because there is almost no work to do!

Then why do you need a language feature? If you've already solved your niche use case, then why change the language?

Korporal commented 5 years ago

How is the length of such strings supposed to be encoded? Are they length prefixed or nul terminated? NUL terminated I'd guess, since your example doesn't show any length field.

My examples (and our current implementation) do not have a length word as part of the design so Length like operations aren't supported. If we were to design a proper implementation the length would be a neat feature, I've simply been alluding to it so far.

Anyway, it's pretty unlikely that you'll ever see something like public struct ANSIString (int Capacity). The runtime does not support that and adding support for that would be quite a bit of work.

Thanks, I have no idea myself but if you're correct then this suggestion may have no future.

The C# compiler could perhaps emulate that to an extent, similar to the way it simulates fixed buffers today, but it's highly problematic to do so. All of the sudden you introduce this notion of a parametrized type into the C# type system, that doesn't have a direct mapping to a runtime type and has to be emulated by creating a bunch of runtime types. That's not going to fly too far, if at all.

Yes I can see how this is a challenge.

Alternatives that involve existing/extended fixed size buffers and span & co., such as

And could have hte language auto make a Span out a fixed size buffer for you. Then, you could add methods/functionality to these guys, while still maintaining the benefits of the fixed-size stuff.

are far more feasible.

But then, going back to my original question, the case of strings is probably a bit more complicated due to the fact that length and capacity are 2 different things. The use of length prefixed strings would probably complicate any implementations that relies on, say, implicit conversion between fixed size buffers and span. And the use of NUL terminated strings isn't without problems, if you say to the users - hey, that span I just handed over to you, you know, has a NUL terminator in the middle and requires special manipulation - they're probably not going to be too happy.

The implementation whether it has a length word or not would of course be completely hidden from consumers and we'd design it to be so. I'm not sure if Span is a solution, all we want is to have a string-like type which has memory allocated inline just as we do for Double, Decimal, Int32 etc.

CyrusNajmabadi commented 5 years ago

The implementation whether it has a length word or not would of course be completely hidden from consumers and we'd design it to be so.

How do you manage this? If someone wants to be able to serialize/deserialize these guys... they're going to need to know where the actual data is, and how it is encoded. How else would your serialization/deserialization layer be able to work? Can you expand on how you would incorporate this into your existing stack without that data?

CyrusNajmabadi commented 5 years ago

I'm not sure if Span is a solution, all we want is to have a string-like type which has memory allocated inline just as we do for Double, Decimal, Int32 etc.

that sounds like a fixed-size-buffer of chars then. It's 'string-like' in that it's a contiguous sequence of characters. It's 'inline'. It has the length encoded with it. It's unclear to me why this is not a suitable solution today.

Korporal commented 5 years ago

I'm not sure if Span is a solution, all we want is to have a string-like type which has memory allocated inline just as we do for Double, Decimal, Int32 etc.

that sounds like a fixed-size-buffer of chars then. It's 'string-like' in that it's a contiguous sequence of characters. It's 'inline'. It has the length encoded with it. It's unclear to me why this is not a suitable solution today.

@CyrusNajmabadi

I'm sure I've explained this to you several times but let me try again. The goal is to be able to define types (primarily structs) that can contain string fields which are not references. Instead these are contiguous blocks allocated inline within the struct much as a fixed buffer is.

The contiguity here is important because then serialization and deserialization is then very very fast and consists of more or less a "memcpy". Providing the source struct is identical to the target struct this is the fastest possible mechanism for serialization since the work required is very very little (many architectures have a single machine instruction for such operations too). The support code for this is already developed and well tuned and lies outside the scope of this conversation.

So having explained that the challenge is then how to give the developer a "thing" that they can treat as a string (assign text to it, compare it with text etc). One way is the way we do it - we create a struct named AString_32 (for example) that encapsulates a fixed buffer of bytes [32] in length along with a set of conversions to/from string that makes it easy to use in code - in fact developers are to all intents and purposes unaware that an AString_32 is not a string, very neat.

But this is also inflexible, can you think why? Yes of course you can, its because one needs many types, one for each supported capacity like AString_32, AString_64 etc a whole family and if one needs some value not in the family they are stuck or they must write their own version with the desired hard coded fixed buffer length.

Making a fixed buffer behave "like a string" is of course embedded in the type and consists of many conversion operators and so on - in fact one must implement a lot of interfaces to get as close as possible to a "real" string and facilitate easy and flexible use.

Because structs cannot inherit from an abstract base we can't put all these support methods in such a base, instead each set of methods must be duplicated over and over inside every variety we create. So AString_32 (for ANSI incidentally) and AString_64 and AString_8 all contain duplicated methods and so this is why we generate these types using a T4 template.

But we would never generate AString_1, AString_2, Astring_3...Astring_32767 because that is ridiculous and even with intellisense selecting one from such a large set is hardly pleasant.

Currently other than fixed buffers with wrapper types there seems to be no alternative to what we do yet because of the benefits we get from this serialization we do leverage this technique.

We also considered extension methods but guess what you can't pass into an extension method? Yep pointers, fixed buffers.

A developer would use these as follows:

// This is single contiguous block of bytes and contains no reference types.

public struct LoginMessage 
{
   AString_32 UserName;
   AString_16 Password;
   AString_8   OtherStuff;
   long     MoreOtherStuff;
   DateTime   SomeDate;
}

LoginMessage msg = new LoginMessage();
msg.UserName = "Charlie"; // current implementation is null-terminated text
msg.OtherStuff = "Other";

byte[] bytes = RuntimeSupport.Serialize(ref msg); // often less than a microsecond on an i7-3960

The type LoginMessagehere is a pure value type and can be used in generic types that expect an unmanagedtype this is impossible if one uses ordinary strings.

So can you perhaps suggest some alternative ways of getting string like inline fields whose length can be specified by the developer at compile time? Adjusting the language or the CLR is what I want to explore here, if its a no-goer then fine but I did want to explore possibilities that's all.

(PS Perhaps the new default interface implementation feature may help here - I need to explore it more).

CyrusNajmabadi commented 5 years ago

So can you perhaps suggest some alternative ways of getting string like inline fields whose length can be specified by the developer at compile time?

Use fixed-size-buffers (FSB)...

So having explained that the challenge is then how to give the developer a "thing" that they can treat as a string

Create some helpers/extensions that work with Span<char>, and pass your FSBs to those extensions as Spans.

Korporal commented 5 years ago

So can you perhaps suggest some alternative ways of getting string like inline fields whose length can be specified by the developer at compile time?

Use fixed-size-buffers (FSB)...

@CyrusNajmabadi

Please re-read what I posted it talks extensively of fixed buffers, I'm beginning to get the impression you are not reading carefully what I'm saying in my replies.

So having explained that the challenge is then how to give the developer a "thing" that they can treat as a string

Create some helpers/extensions that work with Span<char>, and pass your FSBs to those extensions as Spans.

Show me code that does what I just described but is "better" than the code I gave, can you do that without changing the C# language?

CyrusNajmabadi commented 5 years ago

Please re-read what I posted it talks extensively of fixed buffers, I'm beginning to get the impression you are not reading carefully what I'm saying in my replies.

I've read it all. It's still unclear to me why this is not a workable solution.

but is "better" than the code I gave,

I don't know what 'better' means. I also don't see why the onus is on me to provide the 'better' solution. I'm simply saying that i think your scenario is so niche that just using these primitives is an acceptable enough solution. But, for your example, i would do this:

public ref struct LoginMessage
{
    public ReadOnlySpan<char> UserName;
    public ReadOnlySpan<char> Password;
    public ReadOnlySpan<char> OtherStuff;
    public long MoreOtherStuff;
    public DateTime SomeDate;
}

class Test {
    void M() {
        var msg = new LoginMessage();
        msg.UserName = "Charlie".AsSpan();
        msg.OtherStuff = "Other".AsSpan();

        byte[] bytes = RuntimeSupport.Serialize(ref msg); // often less than a microsecond on an i7-3960
    }
}
Korporal commented 5 years ago

Please re-read what I posted it talks extensively of fixed buffers, I'm beginning to get the impression you are not reading carefully what I'm saying in my replies.

I've read it all. It's still unclear to me why this is not a workable solution.

but is "better" than the code I gave,

I don't know what 'better' means. I also don't see why the onus is on me to provide the 'better' solution. I'm simply saying that i think your scenario is so niche that just using these primitives is an acceptable enough solution. But, for your example, i would do this:

public ref struct LoginMessage
{
    public ReadOnlySpan<char> UserName;
    public ReadOnlySpan<char> Password;
    public ReadOnlySpan<char> OtherStuff;
    public long MoreOtherStuff;
    public DateTime SomeDate;
}

class Test {
    void M() {
        var msg = new LoginMessage();
        msg.UserName = "Charlie".AsSpan();
        msg.OtherStuff = "Other".AsSpan();

        byte[] bytes = RuntimeSupport.Serialize(ref msg); // often less than a microsecond on an i7-3960
    }
}

@CyrusNajmabadi

That is certainly interesting (though I hate the 'AsSpan' and it's been a while since I was actively working on this code so excuse my ignorance or Span related stuff).

But tell me how big is an instance of your LoginMessage? How many bytes does the struct occupy? In my example it occupies (more or less and ignoring possible alignment details) 32 + 16 + 8 + 4 + 8 bytes = 68 bytes - its size is always 68 and never varies, irrespective of the values assigned to its members.

This fixed size requirement is inherent in the "memcpy" serialization approach and is why this issue refers to "fixed capacity" strings.

theunrepentantgeek commented 5 years ago

This fixed size requirement is inherent in the "memcpy" serialization approach and is why this issue refers to "fixed capacity" strings.

The whole focus here on "memcpy" serialization bothers me a great deal.

Excepting cases using specialized (and expensive) hardware, as soon as you need to transition out of process, you're talking microseconds of overhead, at least. Transition out of machine, and you're talking milliseconds.

Saving a few dozen nanoseconds on data serialization seems like a micro-optimization that would have negligible effect on throughput. I'm not even sure how to reliably measure the difference.

CyrusNajmabadi commented 5 years ago

But tell me how big is an instance of your LoginMessage? How many bytes does the struct occupy? In my example it occupies (more or less and ignoring possible alignment details) 32 + 16 + 8 + 4 + 8 bytes = 68 bytes - its size is always 68 and never varies, irrespective of the values assigned to its members.

In my example above it would be varying in size. if you truly needed fixed size, you could use fixed-size buffers, a-la:

public unsafe ref struct LoginMessage2
{
    public fixed char _userName[32];
    public fixed char _password[16];
    public fixed char _otherStuff[8];
    public long MoreOtherStuff;
    public DateTime SomeDate;

    public ReadOnlySpan<char> UserName()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_userName, 32);
        }
    }

    public ReadOnlySpan<char> Password()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_password, 16);
        }
    }

    public ReadOnlySpan<char> OtherStuff()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_otherStuff, 8);
        }
    }

    private static ReadOnlySpan<char> CreateSpan(char* pointer, int charCount)
        => new ReadOnlySpan<char>(pointer, charCount * 2);
}

Now you would have completely fixed size, but the convenience of working with ReadOnlySpans. You've mentioned: > all we want is to have a string-like type

Well, ReadOnlySpan<char> is string-like. And there are lots of helpers in MemoryExtensions (like https://docs.microsoft.com/en-us/dotnet/api/system.memoryextensions.indexof?view=netcore-2.2) that would help you treat these like strings.

arekbal commented 5 years ago

@theunrepentantgeek The unfortunate reality of internet is that in everything you do... latency matters and having better latency never hurts. Some examples: game streaming - think about the user input to decoded frame displayed on his screen. Latency value makes or brakes this business. VR game streaming with it's higher requirements makes it even worse. Still, getting 4k on everything will become a requirement in not so far future. All of this increases rendering, encoding, decoding times... there is even less headroom for waste.

In area of cloud computing business, every nanosecond is a "free lunch" somebody must pay for. Faster frameworks mean more requests (let's assume http, websocket) served per second which almost directly converts to cash. Many of these requests are going through multiple servers, which again, directly affects the "bottomline".

But I generally agree there is a lot of layers to the latency and if somebody cares (like it affects his business) he or she should go through whole OSI model and probably cut through. In many areas people just do this.

CyrusNajmabadi commented 5 years ago

The unfortunate reality of internet is that in everything you do... latency matters and having better latency never hurts.

You can have better latency (examples and explanations have already been given on how that can be accomplished). The question is: why is this so critical that it needs to be codified at the language level. Why are the existing solutions insufficient?

--

It's like someone coming along and saying: Http clients are super important. Please add a "http" keyword that gives me a client. When people respond with "but you can already do that at a library level" there needs to be substantial justification as to why that's not a good-enough solution and why the language needs to build in this functionality (and also support it for all time).

CyrusNajmabadi commented 5 years ago

In area of cloud computing business, every nanosecond is a "free lunch" somebody must pay for. Faster frameworks mean more requests (let's assume http, websocket) served per second which almost directly converts to cash. Many of these requests are going through multiple servers, which again, directly affects the "bottomline".

None of what you've said has anything to do with the language. You even mention this is a matter of "faster frameworks". I'm fully behind @Korporal writing a high-perf library to help him out here. But, so far, i haven't seen anything stopping him from doing that. Nor have i seen a compelling case for why the language needs to codify this support directly when it's pretty trivial to use the existing language features for the same purpose.

arekbal commented 5 years ago

Because structs cannot inherit from an abstract base we can't put all these support methods in such a base, instead each set of methods must be duplicated over and over inside every variety we create. So AString_32 (for ANSI incidentally) and AString_64 and AString_8 all contain duplicated methods and so this is why we generate these types using a T4 template.

That could be partly solved with interfaces and constrained generic extension methods. Which - if done well - shouldn't result in boxing.

@CyrusNajmabadi I was responding there to @theunrepentantgeek which he argued - as I read it - that these nano/micro/milliseconds does not matter as much.

CyrusNajmabadi commented 5 years ago

I was responding there to @theunrepentantgeek which he argued - as I read it - that these nano/micro/milliseconds does not matter as much.

I agree with him. Those are not things that matter to the language to warrant a specialized construct over an existing sufficient construct.

The more niche your scenario (and all the cases mentioned so far are super niche) the less important it is to elevate library-specializations and coding-patterns into an actual language feature.

arekbal commented 5 years ago

C# with one of the most wasteful enumerators, lambdas and async/await on the market... Same language used for scripting in most popular game engine in the world - Unity... To make it efficient, they had to destroy the language and go back to primitives because guys like you wouldn't bother dealing with these issues here in the language discussion space. Spans are fine and dandy, I am very glad someone decided to start taking care of performance, but it is a bit late. Game market itself is a monstrously big thing... Activision-Blizzard is market capped at 35.97B. Game streaming market is billions of $ of worth and expected to grow rapidly (but latencies need to go lower)... very puny... very niche... Http web... very niche. It is not fossil fuels... I have no idea of the world you live in, but I am assuming you got there through some sort of cave. 😃

theunrepentantgeek commented 5 years ago

arekbal wrote

which [@unrepentantgeek] e argued - as I read it - that these nano/micro/milliseconds does not matter as much.

It's a matter of scale. Putting significant amounts of effort into shaving nanoseconds is futile if there are millisecond scale factors outside of your control.

The unfortunate reality of internet is that in everything you do... latency matters and having better latency never hurts.

Agreed - to a point.

But if you've got something that routinely takes 3-4 milliseconds, shaving 1000 nanoseconds off that time isn't going to make a meaningful difference.

(This sidebar is wandering well off topic for the feature proposed on this thread, so perhaps we should discard this line of discussion ...)

CyrusNajmabadi commented 5 years ago

I have no idea of the world you live in, but I am assuming you got there through some sort of cave.

I have no idea what your rant is about. Could you explain what it has to do with providing "support for fixed capacity strings"?

VR game streaming with it's higher requirements makes it even worse. Still, getting 4k on everything will become a requirement in not so far future. All of this increases rendering, encoding, decoding times... there is even less headroom for waste.

What waste are you talking about? How is it helped by adding support for "fixed capacity strings". How do those "fixed capacity strings" improve performance over what is already possible today with fixed-size-buffers/spans/etc.?

AFAICT, your posts are completely off topic here.

Korporal commented 5 years ago

But tell me how big is an instance of your LoginMessage? How many bytes does the struct occupy? In my example it occupies (more or less and ignoring possible alignment details) 32 + 16 + 8 + 4 + 8 bytes = 68 bytes - its size is always 68 and never varies, irrespective of the values assigned to its members.

In my example above it would be varying in size. if you truly needed fixed size, you could use fixed-size buffers, a-la:

"if" I truly need? I have repeatedly stressed this as a core objective, by overlooking this you've simply wasted space and chit chat between us!

public unsafe ref struct LoginMessage2
{
    public fixed char _userName[32];
    public fixed char _password[16];
    public fixed char _otherStuff[8];
    public long MoreOtherStuff;
    public DateTime SomeDate;

    public ReadOnlySpan<char> UserName()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_userName, 32);
        }
    }

    public ReadOnlySpan<char> Password()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_password, 16);
        }
    }

    public ReadOnlySpan<char> OtherStuff()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_otherStuff, 8);
        }
    }

    private static ReadOnlySpan<char> CreateSpan(char* pointer, int charCount)
        => new ReadOnlySpan<char>(pointer, charCount * 2);
}

Now you would have completely fixed size, but the convenience of working with ReadOnlySpans. You've mentioned: > all we want is to have a string-like type

Well, ReadOnlySpan<char> is string-like. And there are lots of helpers in MemoryExtensions (like https://docs.microsoft.com/en-us/dotnet/api/system.memoryextensions.indexof?view=netcore-2.2) that would help you treat these like strings.

How is your struct better (more maintainable, easy to reason about, easier to read) than mine:


// This is single contiguous block of bytes and contains no reference types.

public struct LoginMessage 
{
   AString_32 UserName;
   AString_16 Password;
   AString_8  OtherStuff;
   long       MoreOtherStuff;
   DateTime   SomeDate;
}

LoginMessage msg = new LoginMessage();
msg.UserName   = "Charlie"; // current implementation is null-terminated text
msg.OtherStuff = "Other";

byte[] bytes = RuntimeSupport.Serialize(ref msg); // often less than a microsecond on an i7-3960

In mine the buffers are invisible and the manipulation/conversion properties/methods are all implemented inside the types AString_XX, your's is functionally comparable but a huge burden on the developer who simply wants to declare a struct with a few inline strings!

Your's also has to repeat the integer constants, for example 32 has to be written twice, what if the dev wrote the first 32but then by mistake wrote the second as 23?

This is my goal to allow string-like inline fields to be used as readily as Int32or DateTimeor Decimal, no special convoluted code just for things that contain text. I regard this as a reasonable thing to be able to expect and do in a modern and evolving programming language.

So far I've seen nothing that has greater simplicity and readability than what I've described, so my question remains - are there options (including changing the language/CLR) that would avoid the need to have a large library of pre-created AString_XX and UString_XX types?

I can see these as candidates:

  1. Support: public fixed text[32] as a syntax sugar for a fixed buffer and a bunch of conversion operators.
  2. Support the ability to pass a compile time constant into a type: public AString(32) text;
Korporal commented 5 years ago

This fixed size requirement is inherent in the "memcpy" serialization approach and is why this issue refers to "fixed capacity" strings.

The whole focus here on "memcpy" serialization bothers me a great deal.

@theunrepentantgeek

That's fine, there are certainly considerations but nothing that makes the approach inadvisable.

Excepting cases using specialized (and expensive) hardware, as soon as you need to transition out of process, you're talking microseconds of overhead, at least. Transition out of machine, and you're talking milliseconds.

But bear in mind this is not being done purely for network IO (but that is helped hugely because there is almost zero serialization baggage like we see in XML/JSON). We may be serializing to local files for example.

Saving a few dozen nanoseconds on data serialization seems like a micro-optimization that would have negligible effect on throughput. I'm not even sure how to reliably measure the difference.

The technique I'm describing is at least ten times less CPU than protocol buffers, that's not a few nanoseconds its tens of microseconds.

HaloFour commented 5 years ago

I have repeatedly stressed this as a core objective

That's the solution, not the problem. The question is whether the problem itself is worth the effort of a solution, let alone yours. Yes, it impacts your solution, but your solution is pretty far removed from normal C# development.

How is your struct better (more maintainable, easy to reason about, easier to read) than mine:

Uses existing C# and CLR constructs and supports new Span<char> APIs. That latter point is probably more important, and maybe one you could consider instead of trying to recreate the breadth of string members in all of your custom structs.

But bear in mind this is not being done purely for network IO (but that is helped hugely because there is almost zero serialization baggage like we see in XML/JSON). We may be serializing to local files for example.

Also bear in mind that this form of serialization wasn't that uncommon ~30 years ago. It's inherently fragile and and can't be versioned without a lot of additional work. Persisting raw memory structures to blobs is how Office originally saved/read documents. COBOL data files are effectively this as well.

The technique I'm describing is at least ten times less CPU than protocol buffers, that's not a few nanoseconds its tens of microseconds.

Custom and very targeted serialization can almost always beat general purpose serialization. The advantage of the latter is in the maintainability. That said, I'd still want to see benchmarks. And I bet the .NET code generation for protocol buffers could benefit enormously from Span<T> further reducing whatever gap there is today.

Korporal commented 5 years ago

I have repeatedly stressed this as a core objective

That's the solution, not the problem. The question is whether the problem itself is worth the effort of a solution, let alone yours. Yes, it impacts your solution, but your solution is pretty far removed from normal C# development.

@HaloFour

In our case it was the problem because this area (fast serialization) was identified as a major bottleneck in what were doing at the time, this arose from profiling and load testing.

How is your struct better (more maintainable, easy to reason about, easier to read) than mine:

Uses existing C# and CLR constructs and supports new Span<char> APIs. That latter point is probably more important, and maybe one you could consider instead of trying to recreate the breadth of string members in all of your custom structs.

As I asked Cyrus show me an example (in which the message struct is constant size).

But bear in mind this is not being done purely for network IO (but that is helped hugely because there is almost zero serialization baggage like we see in XML/JSON). We may be serializing to local files for example.

Also bear in mind that this form of serialization wasn't that uncommon ~30 years ago. It's inherently fragile and and can't be versioned without a lot of additional work. Persisting raw memory structures to blobs is how Office originally saved/read documents. COBOL data files are effectively this as well.

Like I said in an earlier reply, there are considerations (there always are) and so long as the assumptions and restrictions are understood this is perfectly viable. However one could easily support options here, for example version support might be an option as might type compression and so on, lots of options with performance trade-offs, very routine design challenges for us engineers.

The technique I'm describing is at least ten times less CPU than protocol buffers, that's not a few nanoseconds its tens of microseconds.

Custom and very targeted serialization can almost always beat general purpose serialization. The advantage of the latter is in the maintainability. That said, I'd still want to see benchmarks. And I bet the .NET code generation for protocol buffers could benefit enormously from Span<T> further reducing whatever gap there is today.

Again if someone can show use of Span that results in a fixed size message struct then I'll consider that.

PS: Also recall the binary serialization that came with .Net in the early days? almost nobody used this for anything real despite it offering version support etc. This was because it was notoriously expensive (We found a Microsoft document too that described the internals, it was flexible but ridiculously slow).

HaloFour commented 5 years ago

@Korporal

Again if someone can show use of Span that results in a fixed size message struct then I'll consider that.

Span<T> is agnostic as to where the data is located. That's the point. You can create a fixed inline buffer of bytes and have Span<char> wrap it, which gives you immediate access to the body of APIs that support Span<char>. But the creation of the fixed buffer still falls on you. Span<T> doesn't care where you put it.

PS: Also recall the binary serialization that came with .Net in the early days? almost nobody used this for anything real despite it offering version support etc. This was because it was notoriously expensive (We found a Microsoft document too that described the internals, it was flexible but ridiculously slow).

Yes, it sucked, for a multitude of reasons. Mostly due to reflection. It didn't offer versioning in the earlier versions of the framework, which made it just as fragile. The addition of versioning didn't alleviate all of the other problems with it, tho.

Korporal commented 5 years ago

@Korporal

Again if someone can show use of Span that results in a fixed size message struct then I'll consider that.

Span<T> is agnostic as to where the data is located. That's the point. You can create a fixed inline buffer of bytes and have Span<char> wrap it, which gives you immediate access to the body of APIs that support Span<char>. But the creation of the fixed buffer still falls on you. Span<T> doesn't care where you put it.

@HaloFour

So this amounts (more or less) to the example given earlier by Cyrus?

public unsafe ref struct LoginMessage2
{
    public fixed char _userName[32];
    public fixed char _password[16];
    public fixed char _otherStuff[8];
    public long MoreOtherStuff;
    public DateTime SomeDate;

    public ReadOnlySpan<char> UserName()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_userName, 32);
        }
    }

    public ReadOnlySpan<char> Password()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_password, 16);
        }
    }

    public ReadOnlySpan<char> OtherStuff()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_otherStuff, 8);
        }
    }

    private static ReadOnlySpan<char> CreateSpan(char* pointer, int charCount)
        => new ReadOnlySpan<char>(pointer, charCount * 2);
}

Which I've already critiqued and compared to my example (e.g. why can't the developer have the ability to declare and use an inline string as easily as they can for Int32, DateTime, hard to maintain etc).

The presence of so much repetitive boiler plate code is surely a demonstration of a language inadequacy?

Consider the need to have multiple message structs that must contain a "UserName", changing that from 32 to 64 length would be fragile to say the least.

PS: Also recall the binary serialization that came with .Net in the early days? almost nobody used this for anything real despite it offering version support etc. This was because it was notoriously expensive (We found a Microsoft document too that described the internals, it was flexible but ridiculously slow).

Yes, it sucked, for a multitude of reasons. Mostly due to reflection. It didn't offer versioning in the earlier versions of the framework, which made it just as fragile. The addition of versioning didn't alleviate all of the other problems with it, tho.

Korporal commented 5 years ago

Speaking of ReadOnlySpanwhere on earth is it? I'm editing a simple .Net Frmework project and can't find this...

Korporal commented 5 years ago

Man I can't get the sizeofa fixed buffer, I had hoped that with all the effort on refetc recently this would have been done!

Joe4evr commented 5 years ago

Speaking of ReadOnlySpan where on earth is it? I'm editing a simple .Net Framework project and can't find this...

You may have to install System.Memory from NuGet then.

Korporal commented 5 years ago

@Joe4evr - Thx.

Korporal commented 5 years ago

This is still a challenge. ReadOnlySpan is certainly a way of doing this but leads to the same problem as our AString_XX problem. Namely we can't create flexible types that encapsulate the boiler plate stuff unless we create a family of them.

e.g.


    public unsafe struct UserName //: ICloneable, IComparable, IComparable<string>
    {
        private fixed char text[32];

        public ReadOnlySpan<char> Text
        {
            get
            {
                fixed (void * c = text)
                {
                    return new ReadOnlySpan<char>(c, sizeof(UserName)); 
                }
            }
        }
    }

and this is a minimal implementation, if we add more string related support the list of methods grows and must be repeated (because the code require access to the actual buffer instance) for every type we create (be they named UserName, Password or String_32, String_16).

mikedn commented 5 years ago

I'm not sure if Span is a solution, all we want is to have a string-like type which has memory allocated inline just as we do for Double, Decimal, Int32 etc.

Well, Span is not a solution, never was and never will be. Because your main problem is not how to reference these fixed size, inline strings in code. Your main problem is how to define them in the first place and span has nothing to do with that. Once you're able to define them then yes, Span can be a useful tool to pass such strings around.

While I'm at it - it would be interesting to know how do you pass your own custom types around? Copying? Or you don't, you just rely on implicit conversion to and from string?

if we add more string related support the list of methods grows and must be repeated (because the code require access to the actual buffer instance) for every type we create (be they named UserName, Password or String_32, String_16).

The idea is that you shouldn't add such method to these custom types. They should be used only to define the layout of your serialized objects while the rest of the code should rely on span or other means to use these types.

Korporal commented 5 years ago

I'm not sure if Span is a solution, all we want is to have a string-like type which has memory allocated inline just as we do for Double, Decimal, Int32 etc.

Well, Span is not a solution, never was and never will be. Because your main problem is not how to reference these fixed size, inline strings in code. Your main problem is how to define them in the first place and span has nothing to do with that. Once you're able to define them then yes, Span can be a useful tool to pass such strings around.

While I'm at it - it would be interesting to know how do you pass your own custom types around? Copying? Or you don't, you just rely on implicit conversion to and from string?

if we add more string related support the list of methods grows and must be repeated (because the code require access to the actual buffer instance) for every type we create (be they named UserName, Password or String_32, String_16).

The idea is that you shouldn't add such method to these custom types. They should be used only to define the layout of your serialized objects while the rest of the code should rely on span or other means to use these types.

@mikedn

Our implementation exposes string conversion operators so one can assign from or to ordinary strings. But that conversion is akin to the Span stuff and requires each type (String_16, String_32 etc) to implement these operations (there is some generic code but not much, every String_XX type must implement boiler plate logic.

Here's what one looks like (cant get at src just now so this is a Visual Studio goto-defintion)

    public struct AString_32 : INativeString, IComparable
    {
        public const int MaxLength = 32;
        [FixedBuffer(typeof(byte), 33)]
        public byte* buffer;

        public AString_32(string InitialText);

        public int Length { get; }

        public override string ToString();

        public static implicit operator AString_32(string SourceText);
        public static implicit operator string(AString_32 SourceText);

        [CompilerGenerated]
        [UnsafeValueType]
        public struct <buffer>e__FixedBuffer3
        {
            public byte FixedElementField;
        }
    }

This hasn't been looked at much for over four years so does not take advantage of the the new C# stuff like improved refsupport etc.

Korporal commented 5 years ago

OK Here is the source (created from a T4 template)

public interface INativeString
{
   string ToString();
}

public unsafe struct AString_8 : INativeString, IComparable
{

    public const int MaxLength = 8;

    public fixed Byte buffer[9];

    public AString_8(String InitialText)
    {
      fixed (Byte * p = buffer) {p[0] = 0;}
      if (InitialText == null)
         return;
      Text = InitialText;
    }

    public override string ToString()
    {
      fixed (Byte* p = buffer) return (StringWrapper.ANSIPtrToString(p, sizeof(AString_8)));
    }

    private string Text
    {
      set{fixed (Byte* p = buffer) StringWrapper.StringToANSIPtr(value, p, sizeof(AString_8));}
      get{return(ToString());}
    }

    public static implicit operator AString_8(string SourceText)
    {
      return new AString_8(SourceText);
    }

    public static implicit operator string (AString_8 SourceText)
    {
      return(SourceText.ToString());
    }

    public int Length
    {
      get{return(Text.Length);}
    }

    int IComparable.CompareTo(object obj)
    {
      return(Text.CompareTo(obj));
    }

}

A single source file is created that contains many instances this differing by size (AString_8, AString_16etc etc etc). The file contains around 128,000 lines of C# (both ANSI strings AString_XXand Unicode strings UString_XXare all in same file).

image

The end result is an assembly (NativeStrings.DLL) which is referenced just as any other utility assembly.

Here's an example of a typical message (fragment):

    public class ExceptionSignal : Message<SystemMsgTypes>
    {
        #region Message Payload Fields
        private int os_error_code;
        private int user_error_code;
        private AString_256 error_message;
        private AString_128 method_name;
        private AString_128 original_exception_class_name;
        #endregion
mikedn commented 5 years ago

Our implementation exposes string conversion operators so one can assign from or to ordinary strings. But that conversion is akin to the Span stuff and requires each type (String_16, String_32 etc) to implement these operations (there is some generic code but not much, every String_XX type must implement boiler plate logic.

Yeah, makes sense.

I think it's fair to say that the only way you'd get what you want is to add some kind of parametrized types to the language. And that's unlikely to happen because it's extremely complicated.

Beyond that, it's all a lot of hand waving really (and at almost 100 post it's a lot of hand waving). Because, while fixed size buffers may get the job done to an extent, they'll always have some limitations. For example, the language could probably be extended to support some kind of implicit conversion from a fixed buffer to a span. So that your users can write something reasonable such as:

class User {
    fixed char _name[32];
    public ReadOnlySpan<char> Name => _name;
}

But that won't get you very far due to the string length issue. The resulting string's length won't really be the length, it will be the capacity. And to fix that you'd need to somehow override the implicit conversion provided by the language with some custom logic. And there's no obvious place where you could put that.

Korporal commented 5 years ago

Our implementation exposes string conversion operators so one can assign from or to ordinary strings. But that conversion is akin to the Span stuff and requires each type (String_16, String_32 etc) to implement these operations (there is some generic code but not much, every String_XX type must implement boiler plate logic.

Yeah, makes sense.

I think it's fair to say that the only way you'd get what you want is to add some kind of parametrized types to the language. And that's unlikely to happen because it's extremely complicated.

Beyond that, it's all a lot of hand waving really (and at almost 100 post it's a lot of hand waving). Because, while fixed size buffers may get the job done to an extent, they'll always have some limitations. For example, the language could probably be extended to support some kind of implicit conversion from a fixed buffer to a span. So that your users can write something reasonable such as:

class User {
    fixed char _name[32];
    public ReadOnlySpan<char> Name => _name;
}

But that won't get you very far due to the string length issue. The resulting string's length won't really be the length, it will be the capacity. And to fix that you'd need to somehow override the implicit conversion provided by the language with some custom logic. And there's no obvious place where you could put that.

@mikedn

Yes I think I agree, when we initially crafted this we could see how neat a parameterized typename would be here but I can also see it is not trivial to add.

This is why I mentioned (99 posts ago!) perhaps introducing a new type that was in fact a value type "string" with a Length and Capacity and a fixed buffer but I guess that too is not trivial either, but could perhaps look like this:

public fixed string _name[32];

Of course this is illegal now but would be used by the compiler to generate something like our UString_32 or something along those lines...

mikedn commented 5 years ago

public fixed string _name[32];

Right, that's something that could work. Basically the compiler can generate the necessary (hidden) fields and probably provide an implicit conversion to/from span or something.