Provide support for fixed capacity, variable length value type (inline) strings.

Korporal commented 5 years ago

Strings in C# are perceived as buffers with an (to all intents and purposes) unlimited capacity and for this reason cannot be stored inline as primitive types are. I'm proposing that consideration be given to introducing an additional string type which has a capacity declared at runtime, and thus a maximum possible length.

This then makes it possible to define classes or structs which contain strings yet have these string appear inline, within the datum's memory much as primitive types are.

This is a problem that came up in a sophisticated very high performance client server design in which we got huge benefits by being able to define fixed length messages that contained strings. In our case we simulated fixed capacity strings as properties that encapsulated fixed buffers (char or byte). This worked well but was messy because the language offers no way for us to 'pass' (at compile time) a length into a fixed buffer declaration, one must actually declare the fixed buffer explicitly with a constant.

As a result we created a huge family of types named like this: ANativeString_64 and UNativeString_128 (ansi and unicode variants) and so on, as I say this worked but was messy.

Each type was a pure struct (as in the new generic constraint 'unmanaged') so when used as member fields in other structs left that containing struct pure, giving us contiguous chunks of memory that contained strings.

As I say this worked very well but was messy under the hood and challenging to maintain.

So could we consider a new primitive type:

string(64) user_name;

for example?

Such strings could be declared locally resulting in a simple stack allocated chunk, or as members within classes/structs in which case they appear inline just like fixed buffers do...

(just to be clear I'm not seeking the capacity to be defined at runtime but at compile time, and I know my syntax won't work but wanted to convey the idea).

mikedn commented 5 years ago

Though it's probably too simplistic for your needs. That is, it's either exactly public fixed string _name[32]; and that simply generates a hidden int length field and the 32 char fixed array.

Or it's something far more complicated, that allows you to customize the character type and the length field type (or the lack of it). And the more complicated it gets, the less likely it is for it to happen.

CyrusNajmabadi commented 5 years ago

How is your struct better

Simple. I can do it today with no difficulty. I don't need anything else because it's already available and pretty easy (both codewise and conceptually)

I mean, your question isn't that relevant. It's like asking 'why are methods better than this hyperspecialized method-like construct I'm proposing for an exceptionally niche scenario'.

You keep flipping the burden around. It's not my responsibility to explain why the status quo better. The onus is on you to defend why the language needs to change here for your specific needs.

CyrusNajmabadi commented 5 years ago

Namely we can't create flexible types that encapsulate the boiler plate stuff unless we create a family of them.

I don't understnd this. Why is that the case? You have ReadOnlySpan<char> and Span<char>. Why can't you write helpers that work with those? I mean, lots of those helpers already exist as extensions. What are you missing? You're woefully under-specifying what you're actually looking for here.

i.e. you're saying "C# needs to provide something to help here" and when alternatives are offered you say they are insufficient, but you haven't explained why they're insufficient. But you then use that as a continued argument why the language needs to do something. But i don't even know what it is you want from the language because the deficiencies in what's there with Span/ROS aren't explained.

Korporal commented 5 years ago

@CyrusNajmabadi

I don't understand this.

That's painfully clear to us.

YairHalberstadt commented 5 years ago

@Korporal Who exactly is the us? You seem to be the only person asking this. CyrusNajmabadi is one of the main contributors to Roslyn, and was a member on the LDC. He has an extremely thorough understanding of both language and compiler design. If you want your proposal accepted, you are going to need to convince people like him it's a good idea. Insults don't help your case. Perhaps instead of lashing out, you could take same of the feedback he offers into consideration. He has often disagreed with me in the short time I've participated in this repo and in Roslyn, and he is almost always right.

Korporal commented 5 years ago

@YairHalberstadt

Mikedn's replies to me make it clear that he understands the problem. Cyrus genuinely seems no to, this is not an insult, he himself said he doesn't understand and I agree with that.

Incidentally, I've dealt with these kinds of issues in C# for over a decade and I'm no lightweight myself, I've also developed compilers. In fact I was the first to report this C# compiler bug a month ago, precisely because of high performance C# work.

https://github.com/dotnet/roslyn/issues/31439

Korporal commented 5 years ago

@CyrusNajmabadi - Alright I will try yet again to explain to you the problem I am describing.

How is your struct better

Simple. I can do it today with no difficulty. I don't need anything else because it's already available and pretty easy (both codewise and conceptually)

Yours and I mine are both something we can "do today" so that is hardly "better" Cyrus. Now "need" is subjective and no doubt a common theme when discussing programming languages, since its subjective we have no formal defintion.

Now here is my struct followed by your proposed struct:

// This is single contiguous block of bytes and contains no reference types.

public struct LoginMessage 
{
   AString_32 UserName;
   AString_16 Password;
   AString_8  OtherStuff;
   long       MoreOtherStuff;
   DateTime   SomeDate;
}

LoginMessage msg = new LoginMessage();
msg.UserName   = "Charlie"; // current implementation is null-terminated text
msg.OtherStuff = "Other";

byte[] bytes = RuntimeSupport.Serialize(ref msg); // often less than a microsecond on an i7-3960

and your proposal:

public unsafe ref struct LoginMessage2
{
    public fixed char _userName[32];
    public fixed char _password[16];
    public fixed char _otherStuff[8];
    public long MoreOtherStuff;
    public DateTime SomeDate;

    public ReadOnlySpan<char> UserName()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_userName, 32);
        }
    }

    public ReadOnlySpan<char> Password()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_password, 16);
        }
    }

    public ReadOnlySpan<char> OtherStuff()
    {
        fixed (LoginMessage2* c = &this)
        {
            return CreateSpan(c->_otherStuff, 8);
        }
    }

    private static ReadOnlySpan<char> CreateSpan(char* pointer, int charCount)
        => new ReadOnlySpan<char>(pointer, charCount * 2);
}

If you think that LoginMessage2is "better" than LoginMessagethen we're at an impasse and I cannot force you to adjust your view.

Yours requires the developer of LoginMessage2to write a set of properties and this number increases as the number of (buffer) string fields increases. It also expose Spans rather than strings making the manipulation of a LoginMessage2all the more verbose.

Yours requires the developer to ensure that the 32or the 16(buffer sizes) is repeated correctly in both the buffer declaration and the property that manipulates the buffer mine does not.

Your code would break if the developer altered a buffer declaration length but forgot to alter the property too, mine has no such shortcoming.

As is clear from what I've said so far my LoginMessageworks fine, it runs - "today" - which is why I'm puzzled you would use "do it today" as some form of differentiator between what we do "today" and what you can write "today", it isn't.

I mean, your question isn't that relevant. It's like asking 'why are methods better than this hyperspecialized method-like construct I'm proposing for an exceptionally niche scenario'.

Yet I said no such thing Cyrus.

You keep flipping the burden around. It's not my responsibility to explain why the status quo better. The onus is on you to defend why the language needs to change here for your specific needs.

I'm suggesting that formal serious consideration be given to enabling the creation of code like LoginMessage without the need for me to create a large set of types (AString_32 etc). The simplicity of LoginMessage should be crystal clear and enabling this at a language level is what I'm discussing, as I said @mikedn clearly understands what I'm discussing so perhaps you should read some of his replies.

I really cannot help you understand any further and I have no idea why you cannot understand my position here. I refuse to repeat myself any further and if that means I earn your disfavor and the issue gets closed - so be it.

Thank you.

YairHalberstadt commented 5 years ago

@Korporal What you can't do though, is use your AString_n types in methods that accept a string. And you would need huge code duplication to get a method to work for all AString_n types.

However the ecosystem already supports Span in a lot of places. That is the advantage.

Korporal commented 5 years ago

@Korporal What you can't do though, is use your AString_n types in methods that accept a string. And you would need huge code duplication to get a method to work for all AString_n types.

However the ecosystem already supports Span in a lot of places. That is the advantage.

@YairHalberstadt

Thanks, if the language (or CLR) cannot be changed to provide this then that's fine - I am simply seeking to examine alternatives. If we can't add this to the language then fine our current strategy of a generating a family of AString_XX types works well but is not ideal, it certainly offers far more than Cyrus's propsed approach - IMHO.

YairHalberstadt commented 5 years ago

Can you think of anyway the language could be changed to support this, excluding duplicating every method that accepts a string to accept all sizes of AString_n types?

YairHalberstadt commented 5 years ago

And the CLR is not going to be changed to support this. Changes to the CLR API only ever occur when there is an overwhelming benefit to do so, and the .Net Framework API looks like it's not going to be updated at all.

Korporal commented 5 years ago

Can you think of anyway the language could be changed to support this, excluding duplicating every method that accepts a string to accept all sizes of AString_n types?

@YairHalberstadt - That's a great question, I had hoped there'd be suggestions from the gurus here but clearly this is not straightforward. I will post some ideas with more detail for you guys to consider/critique.

Thx

YairHalberstadt commented 5 years ago

I think the only way to so would be using ReadOnlySpan and/or ReadOnlyMemory.

But once you are doing that I believe your string types could be generated once using CodeGen, and you're sorted. No need to add a language feature to do CodeGen for you, unless it's a seriously common use case.

Korporal commented 5 years ago

@YairHalberstadt @CyrusNajmabadi @mikedn

One idea is to leverage the upcoming support for interfaces with default implementations. Then we could write:

public struct UserName : IValueTypeString<UserName>
{
    private fixed char text[32];
}

where we have

public interface IValueTypeString<T> where T : unmanaged
{
   // Stuff to get at and manipulate the "text" field in "this" instance.
   // Also need to get at the length of "text" too.
   // Ideally include static members so we can cache details from reflection.
   public string Text
   {
      get {...}
      set {...}
   }
}

I'm unfamiliar with the new interface type's rule so this may not even work. But if it did then this would be a step forward because a developer could create an inline string quite easily, eg. our AString_32 would become:


public struct AString_32 : IValueTypeString<AString_32>
{
   private fixed char text[32];
}

Although the developer does need to define the type it is very easy for them to do so, the underlying interface would do most of the manipulation/conversion in a general purpose way.

YairHalberstadt commented 5 years ago

That would require boxing the struct every time a method is called on it

Korporal commented 5 years ago

@YairHalberstadt

What about some variant of the Stringtype then?

I can envisage a type - like String(call it VStringfor now) - in which the type is a value type that contains a fixed buffer along with an actual instance of (slightly modified) Stringin which the string's buffer pointer is the address of the fixed buffer rather than some block allocated form the managed heap...

In principle all the data would be inline in the declaring outer struct...

Basically this amounts to an ability to allocate a Stringobject and its text buffer inline - in the structs memory block - rather than the managed heap.

These are just thoughts and no doubt bad!

YairHalberstadt commented 5 years ago

Whats wrong with this as a code-generated API?

using System;

public unsafe struct AString_32
{
    public fixed char chars[32];

    public ReadOnlySpan<Char> AsSpan()
    {
        fixed (char*  c = chars)
        {
            return new ReadOnlySpan<Char>(c, 64);
        }
    }
}

Korporal commented 5 years ago

@YairHalberstadt

Whats wrong with this as a code-generated API?

using System;

public unsafe struct AString_32
{
    public fixed char chars[32];

    public ReadOnlySpan<Char> AsSpan()
    {
        fixed (char*  c = chars)
        {
            return new ReadOnlySpan<Char>(c, 64);
        }
    }
}

Nothing wrong, but its less powerful than what we generate already:

public interface INativeString
{
   string ToString();
}

public unsafe struct AString_8 : INativeString, IComparable
{

    public const int MaxLength = 8;

    public fixed Byte buffer[9];

    public AString_8(String InitialText)
    {
      fixed (Byte * p = buffer) {p[0] = 0;}
      if (InitialText == null)
         return;
      Text = InitialText;
    }

    public override string ToString()
    {
      fixed (Byte* p = buffer) return (StringWrapper.ANSIPtrToString(p, sizeof(AString_8)));
    }

    private string Text
    {
      set{fixed (Byte* p = buffer) StringWrapper.StringToANSIPtr(value, p, sizeof(AString_8));}
      get{return(ToString());}
    }

    public static implicit operator AString_8(string SourceText)
    {
      return new AString_8(SourceText);
    }

    public static implicit operator string (AString_8 SourceText)
    {
      return(SourceText.ToString());
    }

    public int Length
    {
      get{return(Text.Length);}
    }

    int IComparable.CompareTo(object obj)
    {
      return(Text.CompareTo(obj));
    }

}

Korporal commented 5 years ago

Looking at String it's pretty complex, being able to leverage this logic or clone it it some way so the instance and its buffer are both allocated inline (within a struct's field block) would be interesting.

Here's one place where the buffer is accessed - the String code may be largely agnostic to where the buffer actually is.

YairHalberstadt commented 5 years ago

I can envisage a type - like String (call it VString for now) - in which the type is a value type that contains a fixed buffer along with an actual instance of (slightly modified) String in which the string's buffer pointer is the address of the fixed buffer rather than some block allocated form the managed heap...

What will happen when you do something like this:

string M()
{
    var vString = new VString("HelloWorld");

    return vString.String;
}

Then you would have a pointer to invalid memory in your string

Essentially this is impossible without an ownership model.

What you're suggesting is doable in C++, and idiomatic in Rust. It is however impossible in C#.

YairHalberstadt commented 5 years ago

Your current code generated API requires boxing the struct, allocating a new string, and copying over the chars into the new string every time you want to call a string method on it. Using a ReadOnlySpan solves that problem.

Korporal commented 5 years ago

@YairHalberstadt

I can envisage a type - like String (call it VString for now) - in which the type is a value type that contains a fixed buffer along with an actual instance of (slightly modified) String in which the string's buffer pointer is the address of the fixed buffer rather than some block allocated form the managed heap...

What will happen when you do something like this:
string M()
{
    var vString = new VString("HelloWorld");

    return vString.String;
}
Then you would have a pointer to invalid memory in your string

Essentially this is impossible without an ownership model.

What you're suggesting is doable in C++, and idiomatic in Rust. It is however impossible in C#.

We could impose a rule similar to that used for fixed buffers - only valid within a struct...

Korporal commented 5 years ago

Your current code generated API requires boxing the struct, allocating a new string, and copying over the chars into the new string every time you want to call a string method on it. Using a ReadOnlySpan solves that problem.

Yes the code is dated however and I think could be improved by using the recently enhanced refsupport but I'd have to dive in to get more on that.

Anyway the main goal is to have the raw data inline - that's what enables very fast serialization, the overheads of setting getting the string is secondary.

For example we can write a stream of messages to a disk file very rapidly (and read from a file) because the serialization support includes length and type data. The runtime cost of getting at this or that string property isn't a big concern.

We can (for example) get at message 124,236 in a file and deserialize it very rapidly indeed.

Korporal commented 5 years ago

@YairHalberstadt

What I find interesting (and this is not a criticism of anyone, the team or the language) is that something that seems on the surface straightforward actually presents such big challenge.

YairHalberstadt commented 5 years ago

We could impose a rule similar to that used for fixed buffers - only valid within a struct...

So how would you ever use it? You can't pass it into a method which accepts a string, as maybe the method stores the atring.

YairHalberstadt commented 5 years ago

So if you have a codegened API that works for you, what exactly do your need from the language?

Korporal commented 5 years ago

@YairHalberstadt

So if you have a codegened API that works for you, what exactly do your need from the language?

Simply because the pre-generated code cannot include every conceivable buffer capacity, we gen AString_8, AString_16, AString_24up to something AString_10240with in between sizes unavailable.

This doesn't kill us but I wanted to explore (with the experts) possible options for making this a first class language feature, if this is truly very challenging and costly then that's fine but I am not the best judge - you guys are.

My frustration with Cyrus is that he didn't seem to know what I was trying to explain and that was becoming an impasse.

YairHalberstadt commented 5 years ago

If all you want is some codegened types which someone else is responsible for maintaining, then why don't you suggest they add them in CoreFX?

As far as I can see your current proposal has two parts.

A) provide a shorthand syntax for declaring these fixed size strings (string(32) instead of string_32). Not going to happen - no upside to this.

B) make a string_32 usable as a normal string. This is impossible given the programming model of the CLR. The best you can do is use ReadOnlySpan, but that doesn't require any changes from the language.

So what exactly are you asking for?

Korporal commented 5 years ago

If all you want is some codegened types which someone else is responsible for maintaining, then why don't you suggest they add them in CoreFX?

As far as I can see your current proposal has two parts.

A) provide a shorthand syntax for declaring these fixed size strings (string(32) instead of string_32). Not going to happen - no upside to this.

B) make a string_32 usable as a normal string. This is impossible given the programming model of the CLR. The best you can do is use ReadOnlySpan, but that doesn't require any changes from the language.

So what exactly are you asking for?

@YairHalberstadt - The starting problem statement is asking if it would be possible for C# to support a mutable, inline, fixed capacity, variable length string "type" so that we can create pure value type structs that contain text values as well as primitive values.

Currently pure (as in "unmanaged" generic constraint) structs can only be composed of primitive types or other structs composed of primitive types none of which have any text/string like capabilities.

Recognizing that inline fixed buffers are already supported I wanted to see if that support could be enhanced or built upon these as a possible means of doing this. Being able to assign these from and to a conventional string is the primary goal.

YairHalberstadt commented 5 years ago

The answer is no.

That is fundamentally not how the .Net programming model works.

Korporal commented 5 years ago

@YairHalberstadt - What about some additional operators then, for example tofixedand tostring:

public struct SomeMessage
{
   private fixed char username[32];
   public string Username
   {
      get {return tostring(username);}
      set { tofixed(value,username);}
   }
}

These operators being confined to working with fixed buffers? This would be better overall than having to generate the code we do, despite the fact the developer must define the property its very easy to do - with some kind of "operator" like this.

Note that we can't write (e.g. static) helper methods like this now because getting the capacity of an arbitrary fixed buffer is very hard to do, unless we jump through hoops (as I show in a different thread).

HaloFour commented 5 years ago

Being able to assign these from and to a conventional string is the primary goal.

This is impossible without CLR support, and that's unlikely to happen as System.String can be passed around arbitrarily and doing such with stack space is inherently dangerous, hence the strict rules that C# has around ref locals/returns. As it stands System.String is always heap allocated*. Your current APIs don't avoid these allocations or their costs, they just defer them. And that I think would greatly impact what you consider your deserialization performance if you're not also taking into account the cost of negotiating the string properties of those structs.

* I want to say that I've seen hacks that would allow you to treat stack space as a managed heap object, but you'd have to allocate that memory to match what the reference type expects. For System.String that would be a length and a pointer to the actual string data, so you'd be forced to rewrite the buffer to match that format with the pointer pointing to a location in the buffer. You wouldn't be able to deserialize any blob of bytes as-is.

YairHalberstadt commented 5 years ago

Why not just write a function toString and toFixed?

The general consensus among C# language wonks, is that C# has too many operators to start off with. An operator just adds complexity to the language with very little benefit. Especially for such a rare scenario as yours.

Korporal commented 5 years ago

@YairHalberstadt - see recent edit:

Note that we can't write (e.g. static) helper methods like this now because getting the capacity of an arbitrary fixed buffer is very hard to do, unless we jump through hoops (as I show in a different thread).

YairHalberstadt commented 5 years ago

Then the size of operator for fixed size buffers is the relevant addition to the language, not these operators.

Besides, you could currently just cache the the the length and pass that in. The effort of doing so is not worth a language feature

Korporal commented 5 years ago

@HaloFour

Being able to assign these from and to a conventional string is the primary goal.

This is impossible without CLR support, and that's unlikely to happen as System.String can be passed around arbitrarily and doing such with stack space is inherently dangerous, hence the strict rules that C# has around ref locals/returns. As it stands System.String is always heap allocated*. Your current APIs don't avoid these allocations or their costs, they just defer them. And that I think would greatly impact what you consider your deserialization performance if you're not also taking into account the cost of negotiating the string properties of those structs.

We're not too concerned about the conversion costs, the alternative is a different form of serialization where String is fully supported in our message types. But that immediately becomes a far greater cost than what we do now (we compared this) and prevents us from passing pointers to these structs around, this is another point (and why we did some of this) is that we can create structs that contain text fields yet we can get their address - not possible when struct contains reference types.

A key cost in high performance system like trading systems and so on is needlessly moving data, the less data you move and the faster you can move it the better. Particularly when you make heavy use of IPC as we do.

I want to say that I've seen hacks that would allow you to treat stack space as a managed heap object, but you'd have to allocate that memory to match what the reference type expects. For System.String that would be a length and a pointer to the actual string data, so you'd be forced to rewrite the buffer to match that format with the pointer pointing to a location in the buffer. You wouldn't be able to deserialize any blob of bytes as-is.

Korporal commented 5 years ago

@YairHalberstadt

Then the size of operator for fixed size buffers is the relevant addition to the language, not these operators.

Yes this is probably a better request.

Besides, you could currently just cache the the the length and pass that in. The effort of doing so is not worth a language feature

I'm inclined to agree but the caching incurs a runtime cost (even after being cached to a dictionary) all to get a simple integer constant that was known at compile time. The more types and buffer sizes one has the greater that cost becomes too as the dictionary grows.

Getting the physical size of a fixed buffer (which is always wholly composed of 'n' fixed size primitive types) should I argue, not require user code, caches etc and the associated cost - this is a compile time constant don't forget.

How complex would it be to enable sizeofto accept an identifier that is a fixed buffer declaration which simply returns n * sizeof(buffer_type) - a compile time constant?

YairHalberstadt commented 5 years ago


public interface IFixedBuffer
{
    int FixedBufferLength { get; }
}

public static FixedBufferExtensions
{
    public static ToString<T>(this T buffer) where T : IFixedBuffer
   {
       var length = buffer.FixedBufferLength;
       ...
   }
}

using System;

public unsafe struct AString_32 : IFixedSizeBuffer
{
    public fixed char chars[32];

    public ReadOnlySpan<Char> AsSpan()
    {
        fixed (char*  c = chars)
        {
            return new ReadOnlySpan<Char>(c, 64);
        }
    }

    public int FixedBufferLength => 32;
}

svick commented 5 years ago

@Korporal

I think the main problem with your arguments is that you're asking for a language feature that would specifically benefit your codebase. I don't think that's going to happen, not without demonstrating how that feature would benefit many other codebases.

Specifically:

Anyway the main goal is to have the raw data inline - that's what enables very fast serialization, the overheads of setting getting the string is secondary.

I would like to see some evidence for that. It seems to me that you're not eliminating costs, you're just moving them around. That can be beneficial in some cases (e.g. when you're working with a single property on a large type), but are those cases widespread enough?

We can (for example) get at message 124,236 in a file and deserialize it very rapidly indeed.

That's an argument for fixed-width serialized format, but not necessarily fixed-width in-memory format. Also, a similar effect could be achieved by using a variable-width format along with an index, or even a database.

A key cost in high performance system like trading systems and so on is needlessly moving data, the less data you move and the faster you can move it the better.

That's what confuses me about your approach: you are needlessly moving data, when compared with simple string fields:

When you write a property, you copy the whole string, instead of just a pointer.
When you read a property, you always allocate the string (which includes a copy), instead of allocating it only once at deserialization. (And you could probably do even better if you used Span<char> instead.)
When you copy the struct, you copy all the strings, instead of just few pointers.

Korporal commented 5 years ago

@svick

@Korporal

I think the main problem with your arguments is that you're asking for a language feature that would specifically benefit your codebase. I don't think that's going to happen, not without demonstrating how that feature would benefit many other codebases.

It seems that you're correct here, also from what others say even wide appeal features stand only a small chance of getting included.

Specifically:

Anyway the main goal is to have the raw data inline - that's what enables very fast serialization, the overheads of setting getting the string is secondary.

I would like to see some evidence for that. It seems to me that you're not eliminating costs, you're just moving them around. That can be beneficial in some cases (e.g. when you're working with a single property on a large type), but are those cases widespread enough?

Consider updating say an option price, we can do it pretty much like this:

Option * option_ptr = datastore.GetItem<Option>(key); // can be updated soon to use new "ref" support.

option_ptr->bid_price = new_price;

This is a tiny cost (including the GetItem()) and enables updates to data at a very high rate and very low CPU cost, perhaps just 8 bytes change (e.g. a Decimal) despite the fact the Option might have many fields (including text fields like name, exchange etc).

The datastore incidentally is rather specialized and proprietary and local to the machine running the update operations, we can write to the store like this for example:

Option some_new_option = ...;

datastore.Write(ref some_new_option);

Because the code (a bit dated now but we can convert a ref to a ptr and vice versa with support code) can serialize very rapidly (using what I'm calling "memcpy" for ease of discussion) this too is very fast and low CPU.

We can (for example) get at message 124,236 in a file and deserialize it very rapidly indeed.

That's an argument for fixed-width serialized format, but not necessarily fixed-width in-memory format. Also, a similar effect could be achieved by using a variable-width format along with an index, or even a database.

As soon as the format begins to deviate from its in-memory layout you begin to incur significant costs. Nothing comes close to a single "memcpy" (e.g. CopyBlock). We can do this and have "strings" because of the AString_XXstuff we have.

Furthermore because the data is stored in an identical structure to its managed memory layout we can use managed code (via pointers but we could use refmore now since its been extended) to update the data because the layout is identical.

A key cost in high performance system like trading systems and so on is needlessly moving data, the less data you move and the faster you can move it the better.

That's what confuses me about your approach: you are needlessly moving data, when compared with simple string fields:

Not really, most of the work is updates and most of it to non-string fields.

When you write a property, you copy the whole string, instead of just a pointer.

When you read a property, you always allocate the string (which includes a copy), instead of allocating it only once at deserialization. (And you could probably do even better if you used Span<char> instead.)

When you copy the struct, you copy all the strings, instead of just few pointers.

This is true but as I've said earlier we don't update the "string" stuff much at all, these may be part of a lookup key or data that's used when reports are pulled for example. But 85% of the work is perhaps updating primitive numeric fields and 15% perhaps writing new items both of which operations are very fast.

Bear in mind that the datastore is part of the update service's (a Windows service) address space but not part of the AppDomain, this is a specialized datastore technology (with much of it written in Cand Win32as a native API) and without knowing that some of what I've said in this thread may not appear to make huge sense.

MillKaDe commented 5 years ago

@Korporal

Check this proposal, which would add int parameter(s) to generics: #749

If that proposal gets implemented, you could do something like this:

struct ValueString<CH, const int SZ> {
  fixed CH chars[SZ]; // fixed size inline array with SZ elements of type CH
  // misc functions, properties, operators, ...
}

ValueString<char, 16> MyStringU16; // fixed size string-like value type with 16 Unicode chars
ValueString<byte, 64> MyStringA64; // fixed size string-like value type with 64 Ansi/Ascii chars

To reduce code bloat, the functions of ValueString could be implemented in an inner private empty (field-less) static class / struct. These inner helper functions would take a Span<> (which contains size and address of the fixed array) as parameter. The functions of the outer ValueString struct would be simple (and therefore maybe inline-able) wrappers around the inner work functions ...

Note, that proposal 749 is not limited to chars, bytes, strings, one-dimensional arrays, fixed arrays ...

Korporal commented 5 years ago

@Korporal

Check this proposal, which would add int parameter(s) to generics: #749

If that proposal gets implemented, you could do something like this:
struct ValueString<CH, const int SZ> {
  fixed CH chars[SZ]; // fixed size inline array with SZ elements of type CH
  // misc functions, properties, operators, ...
}

ValueString<char, 16> MyStringU16; // fixed size string-like value type with 16 Unicode chars
ValueString<byte, 64> MyStringA64; // fixed size string-like value type with 64 Ansi/Ascii chars
To reduce code bloat, the functions of ValueString could be implemented in an inner private empty (field-less) static class / struct. These inner helper functions would take a Span<> (which contains size and address of the fixed array) as parameter. The functions of the outer ValueString struct would be simple (and therefore maybe inline-able) wrappers around the inner work functions ...

Note, that proposal 749 is not limited to chars, bytes, strings, one-dimensional arrays, fixed arrays ...

@MillKaDe - Good lord, how did I miss that (I think someone else mentioned it and I glossed over it - inexcusable).

Your are absolutely right, that is exactly what's called for. I think this would work for me, very glad you mentioned this!

Thanks

CyrusNajmabadi commented 5 years ago

If you think that LoginMessage2 is "better" than LoginMessage then we're at an impasse and I cannot force you to adjust your view.

I definitely think it's better. It's something you can do today. It uses the well-supported and understood 'Span/ReadOnlySpan' types. It's really simple (though does require some boilerplate in a few places). It will interoperate with teh rest of the high-perf, low-overhead, side of C#/.net.

Creating something new for this niche case seems pretty objectively worse. It would take years to get it. Would likely need an entirely new way of working with it. Would have to have a design around how it could work in the ref/span world, etc. etc.

CyrusNajmabadi commented 5 years ago

What I find interesting (and this is not a criticism of anyone, the team or the language) is that something that seems on the surface straightforward actually presents such big challenge.

You're proposing something that wants to introduce a very different programing model than hte one that C# has had since 1.0, while also interoperating seamlessly with 20+ years of existing APIs. That's non-trivial.

it's equivalent to me coming to rust and asking it to have a totally different ownership model than what it has today. Or going to C++ and wanting lexical scoping to work entirely differently. It may be 'something that seems on the surface straightforward', but only is that way because it can ignore the deep design decisions and history involved here.

Korporal commented 5 years ago

@CyrusNajmabadi - All I can say in response to your most recent remarks is that it seems to me you've ultimately designed yourselves into a corner. If inline string data types cannot be supported (and this is a rather trivial concept just look at strings in Pascal or PL/1) without the Herculean effort you claim, then that has to tell us something about how you've all designed this.

I can see now why you've been so critical, it's not that what I asked for is some huge piece of functionality, it's because your design and model is too restrictive, too inflexible.

YairHalberstadt commented 5 years ago

@Korporal

If inline string data types cannot be supported (and this is a rather trivial concept just look at strings in Pascal or PL/1) without the Herculean effort you claim, then that has to tell us something about how you've all designed this.

Indeed it does. It tells us that C# is a safe, garbage collected language without an ownership model.

You might as well say that if Prologue cannot support object oriented programming without herculean effort then that has to tell us something about how they've all designed it.

This is how the .Net programming model works. End of story. If you need to do something the programming model doesn't support, use a different language.

Korporal commented 5 years ago

@YairHalberstadt @CyrusNajmabadi - These analogies don't really help nor do I regard them as valid to be frank. Creating a supposed analogy (make Prolog more OO or change the scoping rules in C++) and then discrediting the analogy is referred to as a strawman argument in philosophy and logic, it has no place in a serious technical discussion.

HaloFour commented 5 years ago

@Korporal

If inline string data types cannot be supported

Span<char> inline = stackalloc char[100];

And it seems that there may be interest in treating fixed buffers as spans, which eliminates some boilerplate as you can use them in an expanding ecosystem of APIs.

These analogies don't really help nor do I regard them as valid to be frank.

Every language is a tradeoff of different philosophical concerns. Languages that allow arbitrary stack allocation and reinterpretation are inherently much less safe than C#, especially if they don't have an ownership model. C# and the CLR never has to concern itself with whether or not the memory backing a string has gone out of scope. This is why the compiler is so strict when it comes to ref locals/returns.

YairHalberstadt commented 5 years ago

It is not a strawman argument. You're arguing that something which is easy in a language with a completely different programming model is difficult in C#. Hence C# is badly designed.

We're pointing at that this is an obviously nonsensical argument, and giving some examples of the sort of nonsense conclusions you would come to if you applied that argument.

Korporal commented 5 years ago

@Korporal

If inline string data types cannot be supported

Span<char> inline = stackalloc char[100];

And it seems that there may be interest in treating fixed buffers as spans, which eliminates some boilerplate as you can use them in an expanding ecosystem of APIs.

These analogies don't really help nor do I regard them as valid to be frank.

Every language is a tradeoff of different philosophical concerns. Languages that allow arbitrary stack allocation and reinterpretation are inherently much less safe than C#, especially if they don't have an ownership model. C# and the CLR never has to concern itself with whether or not the memory backing a string has gone out of scope. This is why the compiler is so strict when it comes to ref locals/returns.

Clearly there is no prospect of getting what I sought and that's fine, if the experts see this as a huge challenge then I respect that. But I never asked for arbitrary stack allocation or reinterpretation! What I did seek was a value type mutable fixed capacity string like type which could be used in struct fields in much the same way as primitive types or fixed buffers.

dotnet / csharplang

Provide support for fixed capacity, variable length value type (inline) strings. #2099