dotnet / csharplang

The official repo for the design of the C# programming language
11.41k stars 1.02k forks source link

Provide support for fixed capacity, variable length value type (inline) strings. #2099

Closed Korporal closed 5 years ago

Korporal commented 5 years ago

Strings in C# are perceived as buffers with an (to all intents and purposes) unlimited capacity and for this reason cannot be stored inline as primitive types are. I'm proposing that consideration be given to introducing an additional string type which has a capacity declared at runtime, and thus a maximum possible length.

This then makes it possible to define classes or structs which contain strings yet have these string appear inline, within the datum's memory much as primitive types are.

This is a problem that came up in a sophisticated very high performance client server design in which we got huge benefits by being able to define fixed length messages that contained strings. In our case we simulated fixed capacity strings as properties that encapsulated fixed buffers (char or byte). This worked well but was messy because the language offers no way for us to 'pass' (at compile time) a length into a fixed buffer declaration, one must actually declare the fixed buffer explicitly with a constant.

As a result we created a huge family of types named like this: ANativeString_64 and UNativeString_128 (ansi and unicode variants) and so on, as I say this worked but was messy.

Each type was a pure struct (as in the new generic constraint 'unmanaged') so when used as member fields in other structs left that containing struct pure, giving us contiguous chunks of memory that contained strings.

As I say this worked very well but was messy under the hood and challenging to maintain.

So could we consider a new primitive type:

string(64) user_name;

for example?

Such strings could be declared locally resulting in a simple stack allocated chunk, or as members within classes/structs in which case they appear inline just like fixed buffers do...

(just to be clear I'm not seeking the capacity to be defined at runtime but at compile time, and I know my syntax won't work but wanted to convey the idea).

HaloFour commented 5 years ago

How do you propose something like this be implemented? Types in the CLR must have a known size, so the only method would be to emit a different (and incompatible) struct for every size of this "string".

Korporal commented 5 years ago

@HaloFour - In a similar way that fixed buffers are. In fact these could be implemented as fixed buffers but wrapped in some syntactic sugar.

Korporal commented 5 years ago

string(64)

Becomes

`struct string_64 {

int curr_length; fixed char text[64];

} `

plus a bunch of properties etc.

HaloFour commented 5 years ago

You mention that fixed buffers didn't work for your solution. Any similar solution for strings would have the same limitations, the length would have to be a constant known at compile time. Why weren't fixed buffers sufficient for your purposes?

HaloFour commented 5 years ago

plus a bunch of properties etc.

That's the other problem, you have to generate a bunch of separate members just to make these things workable, and they'd all be incompatible with one another, as well as all normal string APIs.

Korporal commented 5 years ago

@HaloFour

We wanted consumers of our client'server API to be able to freely declare messages, here's what a user defined message might look like (the code is unavailable to me just now so excuse typos).

public class LoginMessage : Message
{
public UNativeString_64 Name; // 'U' for unicode
public UNativeString_16 Password;
public UNativeString_32 Application;
...
}

This is how we wanted consumers to use it (and they do) but as you can see we needed a family of types (structs) for a large set of predefined capacities. We use a T4template to generate these types. Because structs cannot inherit we could not use an abstract base class so we had to rely on an interface (INativeString).

That interface defined to/from string conversions and compare etc.

With the above design it worked well but a user could not use a UNativeString_132if that wasn't one of the variants we created and we could not create one for every possible capacity, we stopped at like 2048 or so and went up like 4, 8, 16, 32, 64, 96, 128 etc etc.

So as you can see the consumer has no idea how UNativeStringis implemented and they cannot even see the underlying fixed buffer. (There's also an ANativeStringfor single byte charsets).

In user code these members were interchangeable with stringbecause of the conversion operators and so they had no idea that these were not actually strings.

At runtime a LoginMessagewas a single contiguous block of memory something we can serialize very very quickly indeed. (A million instances per second on an i7 3960 CPU core).

Korporal commented 5 years ago

plus a bunch of properties etc.

That's the other problem, you have to generate a bunch of separate members just to make these things workable, and they'd all be incompatible with one another, as well as all normal string APIs.

Exactly, that's the motivation for this post, to introduce a new kind of string type that provides all this out of the box.

HaloFour commented 5 years ago

Seems like a very highly specialized solution that would have very narrow benefits but is massively complicated.

Exactly, that's the motivation for this post, to introduce a new kind of string type that provides all this out of the box.

You're not asking for one string type, you're asking for 2 billion potential string types, every single one of them with a separate set of members. That would certainly result in metadata explosion.

Korporal commented 5 years ago

Seems like a very highly specialized solution that would have very narrow benefits but is massively complicated.

Exactly, that's the motivation for this post, to introduce a new kind of string type that provides all this out of the box.

You're not asking for one string type, you're asking for 2 billion potential string types, every single one of them with a separate set of members. That would certainly result in metadata explosion.

@HaloFour

Well I'm certainly asking for something but not quite that. What we want (ultimately) is some syntactic mechanism that can convert this:

string(75) user_name;

into this:

struct string_75
{
fixed char user_name[75];

...various properties...
}

Or ideally a new type that is implemented better than this, but one that enables the user to declare the capacity at compile time without them needing any knowledge of the implementation.

C# could perhaps support the passing of constants into the declaration of type instances, this would probably be enough.

I mean support this:

public class MyClass (int size) // This language feature would require the supplied value be a compile time constant.
{

private fixed byte Name[size];

}

This way we could create types that must contain compile time constants, that let the consumer of the type provide that compile time constant.

Then a consumer could just code:

MyClass(79) SomeMessage;

This is I think the fundamental requirement here, a way to propagate compile time constants into type declarations...

HaloFour commented 5 years ago

This is I think the fundamental requirement here, a way to propagate compile time constants into type declarations...

That would involve CLR changes, and the end result is effectively the same, it's just that now the CLR has to generate potentially 2 billion different flavors of that class.

Korporal commented 5 years ago

This is I think the fundamental requirement here, a way to propagate compile time constants into type declarations...

That would involve CLR changes, and the end result is effectively the same, it's just that now the CLR has to generate potentially 2 billion different flavors of that class.

@HaloFour

Perhaps, so I guess what I'm asking is for a better way to solve this problem - on the surface this sounds rudimentary - provide support for fixed capacity (value type based and hence inline) strings - if we forget about what I've said above and what we currently do to implement this and just step back and view this as an abstract problem - are there options?

Many other languages support the idea of fixed capacity strings so in principle this isn't a major challenge...or is it?

We used fixed buffer wrapper structs only because we had no other way to deliver this, but being able to do this at a deeper langauge/CLR level might well be slicker and less messy, some of the problems you mention may be due purely to the way we implemented this and not necessarily inherent in the problem itself.

CyrusNajmabadi commented 5 years ago

Let's work backwards on this a bit. IN many cases the language has adopted these sorts of 'less easy to use' but 'much more performant' solutions when the gains were made quite explicit. Could you show some real world examples that would benefit from this (along with measurements)? Basically, a real world piece of code that you would envision using this.

To get the perf measurements, it would likely suffice to convert that real code to use fixed-size-buffers and see what the resulting difference was.

HaloFour commented 5 years ago

Many other languages support the idea of fixed capacity strings so in principle this isn't a major challenge...or is it?

The challenge here is the CLR which offers no real facility to accomplish this. Without the CLR it would be relatively easy. Languages that did support fixed-length strings, like Visual Basic, lost them in the transition to .NET.

Korporal commented 5 years ago

Let's work backwards on this a bit. IN many cases the language has adopted these sorts of 'less easy to use' but 'much more performant' solutions when the gains were made quite explicit. Could you show some real world examples that would benefit from this (along with measurements)? Basically, a real world piece of code that you would envision using this.

To get the perf measurements, it would likely suffice to convert that real code to use fixed-size-buffers and see what the resulting difference was.

@CyrusNajmabadi @HaloFour

The application is a .Net to .Net messaging platform. Because of this instances can be serialized (basically) as a memcpy, the performance of this is outstanding (as I mentioned one benchmark sees 1,000,000 instances per second on a i7 3960 core, these are small messages but did contain a few of these fixed length strings.

Other forms of serialization do not achieve these levels.

Recent advances to C# (improved refsupport and generic pointers and unmanagedconstraint) solve many of the problems we had to address through our lower level code and we could rewrite some of these lower layers now and simplify them quite a lot.

However the inline fixed capacity strings remain as a contrivance in our design as I explained this is quite ugly (but works very well).

Because a type layout in the CLR is the same across different machines (guaranteed if we're using the same version of the CLR at each end) it is very easy to serialize an instance (even of a class) provided all fields are inline. The recipient then simply creates an instance of the type and "overwrites" its field block with the received block of bytes that were sent.

That's the principle anyway (of course we also need to send and cache type descriptions and so on but this is all part of the low level handshaking and protocol).

All of this sits on top of a robust async socket management layer with ring buffers and stuff, but at the outer level app developers can just creates classes that inherit from Messageand everything works very well and very very fast (we do runtime checks that ensure their class truly consists only of unmanaged fields, caching this info for later reuse).

Of course sending the source for this is not easy as the system contains lots of other proprietary stuff and includes some dynamic method generation for certain things.

I'm sure you get the idea though and I'm sure you can understand how this achieves very high performance.

HaloFour commented 5 years ago

@Korporal

Because a type layout in the CLR is the same across different machines

That sounds like a very dangerous assumption. From my understanding, unless you're using explicit layout, the CLR will layout the native member of that struct any way it sees fit, which can differ based on platform.

Anyhow, in regards to benchmarking, you're probably going to have to demonstrate that difference and why only this particular solution is suitable. And you're likely going to have to compare that to the various other high-performant serialization libraries out there.

CyrusNajmabadi commented 5 years ago

The application is a .Net to .Net messaging platform. Because of this instances can be serialized (basically) as a memcpy, the performance of this is outstanding (as I mentioned one benchmark sees 1,000,000 instances per second on a i7 3960 core, these are small messages but did contain a few of these fixed length strings.

Other forms of serialization do not achieve these levels.

  1. Can you give numbers on what the values would be here with a normal string?
  2. can you provide a small, but somewhat realistic example program? i.e. i don't want a total micro-benchmark. but i would lke to see something showing an expected usage pattern, with normal incoming data, and how this would be different.

Thanks!

Korporal commented 5 years ago

@Korporal

Because a type layout in the CLR is the same across different machines

That sounds like a very dangerous assumption. From my understanding, unless you're using explicit layout, the CLR will layout the native member of that struct any way it sees fit, which can differ based on platform.

Anyhow, in regards to benchmarking, you're probably going to have to demonstrate that difference and why only this particular solution is suitable. And you're likely going to have to compare that to the various other high-performant serialization libraries out there.

@HaloFour - All designs involve compromises and assumptions, provided one is very clear about what these are and takes steps to verify these at runtime where necessary then what we do works very very well. For users who use Windows and the same hardware architecture (Intel, AMD) on the participating nodes (not a huge requirement) they can get these gains in performance.

CyrusNajmabadi commented 5 years ago

Anyhow, in regards to benchmarking, you're probably going to have to demonstrate that difference and why only this particular solution is suitable. And you're likely going to have to compare that to the various other high-performant serialization libraries out there.

Agreed. This is very much feeling like a library problem currently. It may be necessry to elevate to a CLR/Language problem. But it would be really necessary to demonstrate why existing library solutions are insufficient.

Note: CoreFx/asp.net was pretty involved in passing feedback along about the areas they needed help for perf. That's what led to all the ref/span/readonly stuff. I don't recall any feedback about this particular area. And they're def trying to make high-perf servers.

Korporal commented 5 years ago

The application is a .Net to .Net messaging platform. Because of this instances can be serialized (basically) as a memcpy, the performance of this is outstanding (as I mentioned one benchmark sees 1,000,000 instances per second on a i7 3960 core, these are small messages but did contain a few of these fixed length strings.

Other forms of serialization do not achieve these levels.

  1. Can you give numbers on what the values would be here with a normal string?
  2. can you provide a small, but somewhat realistic example program? i.e. i don't want a total micro-benchmark. but i would lke to see something showing an expected usage pattern, with normal incoming data, and how this would be different.

Thanks!

  1. I could spend time doing that but as soon as the object's fields contain reference types one must use an alternative serialization method and even protocol buffers do not come close.

  2. Not sure exactly what you're seeking here, you mean what the user would write? or what the underlying architecture looks like?

At the outermost level an app developer creates an instance of MessageChannelthat providse sync/async way to send/recv data. For example SendMessage(sync) takes a Messageinstance (that is a user class derived from Message).

The base message class contains a lot of low level mechanisms that enable us to get the address and size of the object's field block and then "memcpy" it to a byte[] the rest you can probably envisage).

Korporal commented 5 years ago

I should add too that the system includes various optional compression and encryption modes, but these are not rocket science as you can imagine. I cannot over stress the impact that making all data inline has, this is the key to outstanding performance (I used to work in the City of London many years ago and have a lot of experience in this area on various platforms and languages).

CyrusNajmabadi commented 5 years ago

the rest you can probably envisage).

I'd prefer if htere was a real piece of code that could be used as the exemplar case here. :)

Honestly, i'm not trying to make your life hard. I'm just pointing out that for language features that exists solely for perf needs, we need real world code to look at and understand, so we can best assess what the right sort of solutions would be (and just to validate how things would improve).

--

Another way of putting it:

Imagine if we added this feature... and then you used it... and it didn't make performance any better. The feature would be a failure at its core goal. So we actually need some way of validating things.

Furthermore, imagine if we added this feature, and you couldn't use it, because there was some limitation (akin to the ref limitations we have), and your own use case violated that limitation. This would also then fail.

CyrusNajmabadi commented 5 years ago

I should add too that the system includes various optional compression and encryption modes, but these are not rocket science as you can imagine. I cannot over stress the impact that making all data inline has,

But you need to. Because other teams that are doing precisely this are not coming to us with this being a use case that must be addressed. These other teams are working in very competitive arenas, trying to squeeze out all the perf they can. Right now, this isn't a place they are finding problematic. So it's hard to gauge for certain if what you are saying is generally applicable, or if this is a very specific problem to your domain.

CyrusNajmabadi commented 5 years ago

Another way of putting it: You're the one asking for this to be done. Like it or not, that means the legwork is on you to provide enough compelling data to make others feel like this is worthwhile. It's unlikely that anyone else is going to go do it for you. So, to maximize your change of success here, it is necessary to go beyond just saying you'd find it useful for your use case :)

Korporal commented 5 years ago

the rest you can probably envisage).

I'd prefer if htere was a real piece of code that could be used as the exemplar case here. :)

Honestly, i'm not trying to make your life hard. I'm just pointing out that for language features that exists solely for perf needs, we need real world code to look at and understand, so we can best assess what the right sort of solutions would be (and just to validate how things would improve).

--

Another way of putting it:

Imagine if we added this feature... and then you used it... and it didn't make performance any better. The feature would be a failure at its core goal. So we actually need some way of validating things.

Furthermore, imagine if we added this feature, and you couldn't use it, because there was some limitation (akin to the ref limitations we have), and your own use case violated that limitation. This would also then fail.

@CyrusNajmabadi - I understand Cyrus, I guess the only way to show the benefit would be to alter our library to support an alternative serialization method but the system is deeply predicated on this (at the core anyway) so this would be quite a lot of work.

Bear in mind that the performance gain here is pure CPU, the cost of CPU time in sending and receiving these messages is is far lower than something that uses XML serialization, MS Binary serialization or protocol buffers.

A memcpy of a message is very tiny perhaps hundreds of nanoseconds or less on an i7 3960 (our reference CPU when working on this).

Your question is interesting because we need to compare this architecture with another one and I don't have that other one.

HaloFour commented 5 years ago

@Korporal

Your question is interesting because we need to compare this architecture with another one and I don't have that other one.

What I would suggest is reimplementing a very basic form of this serialization architecture that could be compared directly to other serialization methods. Afterall, any solution here would have to be very general purpose.

Korporal commented 5 years ago

I did have benchmarks of the serialization layer, I'll see if I can dig these out - that might be a start!

Korporal commented 5 years ago

@Korporal

Your question is interesting because we need to compare this architecture with another one and I don't have that other one.

What I would suggest is reimplementing a very basic form of this serialization architecture that could be compared directly to other serialization methods. Afterall, any solution here would have to be very general purpose.

I agree, and recent support for generic pointers and unmanagedconstraint could be used to remove some of our runtime verification steps (like where we check that a type contains no reference fields).

Korporal commented 5 years ago

I must get a flight soon, but thanks for taking the time to explore this.

tannergooding commented 5 years ago

This proposal is very similar to the fixed-sized buffer proposal which is already championed: https://github.com/dotnet/csharplang/issues/1314. However, this proposal seems more specialized.

CyrusNajmabadi commented 5 years ago

@CyrusNajmabadi - I understand Cyrus, I guess the only way to show the benefit would be to alter our library to support an alternative serialization method but the system is deeply predicated on this (at the core anyway) so this would be quite a lot of work.

Understood :) But... well... that comes with the territory. If you want to make a language change (esp. one related to perf), then this sort of comes with the territory. The only way to escape it is to get someone excited enough to do it for you :)

CyrusNajmabadi commented 5 years ago

This proposal is very similar to the fixed-sized buffer proposal which is already championed: #1314. However, this proposal seems more specialized.

@tannergooding Agreed. One thing 'interating' about this space is that @Korporal seems to want to be able to provide additional functionality on top of a fixed-size-buffer. So, for example, imagine you had a fixed char mystring[256]. If you could say something like:

public static int IndexOf(this in Span<char> str, in Span<char> value) { ... }

And could have hte language auto make a Span out a fixed size buffer for you. Then, you could add methods/functionality to these guys, while still maintaining the benefits of the fixed-size stuff.

HaloFour commented 5 years ago

@tannergooding

Sorta, only to the extent that it also deals with fixed data. But if anything it layers on top of fixed char which is already supported, but with an additional field somewhere for "actual length" and with additional APIs to help with interop with string/Span<char>. But the primary use case here seems to be a very specialized serialization mechanism which wouldn't be a benefit to anyone else.

What also really concerns me is that general support for this in the language might encourage developers to go hog-wild with it and declare their entity types with all "fixed-length strings" resulting in tons and tons of these structs being generated.

CyrusNajmabadi commented 5 years ago

@tannergooding Good points!

tannergooding commented 5 years ago

But if anything it layers on top of fixed char which is already supported,

Right, which is part of what the generalized fixed-sized buffer proposal does. The generalized fixed-sized buffer proposal relies on some of the new functionality around ref types so that it would work with any type (including reference types like string or object). What isn't explicit in the fixed-sized buffer proposal today would be implicit conversion to Span, but I don't think that would be unreasonable to expose (and it would be trivial to create an extension method wrapper for otherwise).

resulting in tons and tons of these structs being generated.

Yes. For primitive types (like char) that have a guaranteed size on all platforms, there isn't too much concern. You get a struct with a single field and an explicit size (just like fixed-sized buffers today), so the metadata bloat isn't bad. However, for non-primitive structs or for platform-sized types (like IntPtr or reference types), you end up having to declare a field per element to ensure that the layout will be correct across various platforms/architectures. This does quickly bloat metadata and would need a solution to mitigate it. One solution that I had brought up before was providing some helper types in S.R.CompilerServices that represent the various powers of 2 (up to some pre-determined limit, around "normal" usage). This ensures that each power only needs 2 fields (of the previous power) to be represented and reduces any metadata bloat. Them being part of the framework means that they can be shared across libraries as well.

Korporal commented 5 years ago

I’ve been thinking about this and have an outline for nearly as fast serialization design. However this still requires us to know that some string has a max capacity. I’m now considering a simple attribute could be used. The serializer would use this value and we could use ordinary strings. The domain here is high volume options trading.

The problem though is we could not leverage the new generic pointer capability or the unmanaged constraint (we did the equivalent ourselves though) because strings are reference types.

Much of what I’ve said here today amounts to a request to provide a value type string data type, that’s the crux of this.

In our current implementation users can create message instances containing string like fields and we can do a memcpy of the entire message - this works now and I dearly wanted to revise the code to use the new ref and generic pointers etc.

HaloFour commented 5 years ago

@Korporal

IIRC the ASP.NET team is currently working on a new JSON library that will be based on ref and Span<char> for the sake of performance, might be worth paying attention to.

I'd wager that it's exceptionally rare for people to want to serialize into a raw memory format like this, so I'm not sure that would stand by itself as a justifiable use case. And if you're deserializing by blitting into these fixed buffers and then converting that to strings afterwards it sounds like you're just delaying the performance penalty, not avoiding it.

arekbal commented 5 years ago

@Korporal TLDR It is not enough to have something like what you hope for in CLR. You should default to code generation...

As I often argued (in discussions outside of this "forum")... any powerful serializer would HAVE to use code generation as basic approach. Reflection sucks if you care (and you should), for performance. Especially when you talk about custom types and binary serialization, which could be packed in so many ways only if you agree to use custom contract for producer and consumer. Examples:

Ideally, you wouldn't want to build full CLR types out of it, because it means big objects and data duplication(CLR types + buffers) which is lame for performance. So, in performant oriented world the buffer builders for outputs and/or buffer views on input would be all you got... I think dbms design orientation would be right way to do it. I am willing to provide help to anyone building something like that. ;)

Korporal commented 5 years ago

@Korporal

IIRC the ASP.NET team is currently working on a new JSON library that will be based on ref and Span<char> for the sake of performance, might be worth paying attention to.

I'd wager that it's exceptionally rare for people to want to serialize into a raw memory format like this, so I'm not sure that would stand by itself as a justifiable use case. And if you're deserializing by blitting into these fixed buffers and then converting that to strings afterwards it sounds like you're just delaying the performance penalty, not avoiding it.

Hi @HaloFour - The rarity is no doubt true but does not arise from a lack of utility but only a lack of opportunity since such an option is all but impossible given the current languages serialization capabilities. It's like like arguing against investing in mouse technology during the days of keyboard only interaction; mice were as you know rare.

The cost of conversion between .Net String and a fixed buffer implementation of fixed capacity strings isn't a significant concern. It is the ability to serialize from/to contiguous blocks that provides the huge performance gains, this is pretty much impossible to do without support for a fixed capacity string and thus forces much more costly serialization mechanisms to be used.

Korporal commented 5 years ago

@Korporal TLDR It is not enough to have something like what you hope for in CLR. You should default to code generation...

As I often argued (in discussions outside of this "forum")... any powerful serializer would HAVE to use code generation as basic approach. Reflection sucks if you care (and you should), for performance. Especially when you talk about custom types and binary serialization, which could be packed in so many ways only if you agree to use custom contract for producer and consumer. Examples:

  • SMS Codes That's like only a subset of upper letters and some of numbers (no O vs 0 confusion and so on)
  • All quantities could be described with variable length ints.
  • Delta encoding with smaller bit sizes for arrays of recognized values. And so on... so on...

Ideally, you wouldn't want to build full CLR types out of it, because it means big objects and data duplication(CLR types + buffers) which is lame for performance. So, in performant oriented world the buffer builders for outputs and/or buffer views on input would be all you got... I think dbms design orientation would be right way to do it. I am willing to provide help to anyone building something like that. ;)

@arekbal

Much of what you write is true but bear in mind we do not currently rely on simplistic reflection. Instead we simply get a pointer to the start of an instance's fields and (having already determined (a one-time operation) the fixed number of bytes these occupy).

This lets us copy that raw byte block to a comms buffer and upon receipt we do the opposite - create an instance of the same type, get the address of it's instances fields and simply (to all intents and purposes) do a "memcpy" - that's it, that's all we do and we can do this for objects that contain strings ONLY because those strings are of a known fixed capacity.

The simple fact here is that by having fixed capacity "strings" we can leverage thus technique for objects that must contain text (very very very common in securities industry).

Note that getting the physical size of an arbitrary object's instance field on any platform is also not supported, we do this using a dynamically generated helper method that we create once only during the first attempt to serialize/deserialize a type.

Both of these (fixed capacity strings and determination of the true size of an objects fields) are necessary for our technique but the latter isn't much effort and buried inside our lower level support, but fixed capacity strings is MUCH better when represented and fully supported at the C# language level, there is no sound reason in most programming languages to insist that all text, string date be assumed to never have a maximum capacity.

jnm2 commented 5 years ago

I'd wager that it's exceptionally rare for people to want to serialize into a raw memory format like this

https://capnproto.org/

Korporal commented 5 years ago

In fact I recall in days of old using PL/I which supported "fixed" and "varying" strings. Neither of these abstractions are represented in Java or C# yet are incredibly practical when it comes to interop, very odd.

Korporal commented 5 years ago

I'd wager that it's exceptionally rare for people to want to serialize into a raw memory format like this

https://capnproto.org/

Indeed, and this technique is very common in older non OO languages and particularly in the securities industry. The stuff we have is comparable to what's described in that website but is for .Net and boy is it fast.

CyrusNajmabadi commented 5 years ago

@Korporal Have you considered that your needs are simply to specialized/niche to warrant a general purpose solution at the language level?

CyrusNajmabadi commented 5 years ago

Indeed, and this technique is very common in older non OO languages and particularly in the securities industry.

I mean... htere's lots of stuff in older (and newer) languages not represented in C#. That's pretty much par for the course :) C# isn't trying to be a superset of all things.

CyrusNajmabadi commented 5 years ago

there is no sound reason in most programming languages to insist that all text, string date be assumed to never have a maximum capacity.

Of course there is. Such a concept adds complexity to the language. And it is totally reasonable for the language to want to limit complexity in any given area.

Korporal commented 5 years ago

@Korporal Have you considered that your needs are simply to specialized/niche to warrant a general purpose solution at the language level?

@CyrusNajmabadi

Yes, I've considered that which is why despite the fact this technology was crafted ten years ago I've never publicly described it.

CyrusNajmabadi commented 5 years ago

Both of these (fixed capacity strings and determination of the true size of an objects fields) are necessary for our technique but the latter isn't much effort and buried inside our lower level support,

Is there a reason that fixed-size-buffers woudn't work for you? Or simply a struct with a char* in it where you know hte size and memcpy accordingly?

Why do you need a new concept for this? What would that actually buy you here?

CyrusNajmabadi commented 5 years ago

Yes, I've considered that which is why despite the fact this technology was crafted ten years ago I've never publicly described it.

Ok, given that, it doesn't seem surprising to me that C# would not view this as special enough to warrant focus.

Contrast that with, say, UTF8 strings which would have a large and immediate benefit for a much broader set of users. Your space just seems too niche to warrant all the exceptional effort that would be needed here.

Now, if .net/asp/corefx came along and said: the lack of this is what is killing us right now. Then, that might change things.

Korporal commented 5 years ago

there is no sound reason in most programming languages to insist that all text, string date be assumed to never have a maximum capacity.

Of course there is. Such a concept adds complexity to the language. And it is totally reasonable for the language to want to limit complexity in any given area.

@CyrusNajmabadi

I can't discuss the supposed complexity for the C# language and CLR since I do not have the degree of insight into the internals as some here do. I will say though that an inline fixed capacity string with either a fixed or varying (current) length is rather a simple concept IMHO.

The fact that I can can't create a simple struct containing text fields and have that struct be a contiguous block is IMHO a weakness and one that should be addressed if the C# language team seek to continue to improve performance, particularly in the growing area of IOT technology (think .Net Micro Framework for example).

Korporal commented 5 years ago

Yes, I've considered that which is why despite the fact this technology was crafted ten years ago I've never publicly described it.

Ok, given that, it doesn't seem surprising to me that C# would not view this as special enough to warrant focus.

Contrast that with, say, UTF8 strings which would have a large and immediate benefit for a much broader set of users. Your space just seems too niche to warrant all the exceptional effort that would be needed here.

Now, if .net/asp/corefx came along and said: the lack of this is what is killing us right now. Then, that might change things.

It's not for me to decide what the language team do and do not decide to devote their efforts too. Adding (what is in effect) a value type form of string (call it text) may be complicated or may be easy. The implementation in fact that I have described serves as a way to do this and could conceivably be added as syntactic sugar and I'm sure many here could suggest how this could be written.

For example:

public struct option_quote
{
   public double Price;
   public inline(32) string OptionName;
}

This could be a true reference type string as it exists now but where the memory for it is not allocated from some pool (e.g. stack or managed heap etc) but is actually a referenceto an address literally situated within the struct instance itself.

Or we could wrap a fixed buffer in a struct and add supporting conversions etc except there is no means within the language to pass a constant into a struct declaration in such a way that that constant can be used in a fixed buffer length specification.

e.g.

public struct Text
{
   private fixed char text[X]
   // Conversion properties etc etc follow here
}

C# currently affords me no way to pass a constant int X into such a struct declaration and therefore no means of creating a general purpose struct that can do what we want here.

This is why had to create (using T4 technology) Text_16, Text_32, Text_64 etc etc - so perhaps introducing a way to pass such a compile time constant into a type like this is all that's needed - is that easier to do?