dotnet / csharplang

The official repo for the design of the C# programming language
11.45k stars 1.02k forks source link

Champion "slicing" / Range (16.3, Core 3) #185

Open gafter opened 7 years ago

gafter commented 7 years ago

Introduce a .. syntax, such that x..y ends up creating a Range of x and y.

See also

mstrother commented 7 years ago

A lot of great discussion here. I don't think I've seen anyone mention Go, https://blog.golang.org/go-slices-usage-and-internals

Slices need to make the language easier to use.

var slice1 = data[2:]
var slice2 = data[:2]
var slice3 = data[2:2]

Start and end indices should be optional and default to 0 and the maximum length.

I don't think slicing multidimensional arrays is worth the added complexity.

I think it's intuitive, much needed, and could be extended to other areas of the language.

ArraySegment should be deprecated when slices are introduced.

wanton7 commented 7 years ago

Will there be support for single item slice wrapper? We have lot of code that uses IList<MyClass> as parameter or property in our code, that could be changed to use a slice. Most of the time we only have single item in that IList<MyClass> property or parameter. We do lot of this new [] { myClass }. It would so great if there was way to wrap single instance to an slice without allocating from heap.

mstrother commented 7 years ago

Is this poposal taking into account the work done previously for Slice.NET ?

mstrother commented 7 years ago

I just realized Slice.NET became Span in corefx. Great to see this moving ahead. A long overdue feature.

ghost commented 7 years ago

I agree Pascal-based syntax array[3..5] is more natural. Also it can be more readable in case of variables:

array[start..end]
array[start:end]

But maybe it is not so important and it can be any...

I am worry much more about memory allocation. When we should use existing array block and when we need to create a copy?

This code can easy look like an introducing of variable to the same memory block: var slice = array[500..1000];

Let's say it is some sort of projections. I guess other members of array can be collected by GC if nothing references it anymore. In theory it can fragment memory in case of a lot such long lived projections. But all that can be discussed in addition.

Regarding copying... Maybe to differentiate copying of array we can use 'new' keyword in this way: var slice = new array[500..1000]; Or anything else... but main idea: differentiate 'projecting' and 'copying'.

alrz commented 7 years ago

array[start..end]

Actually it can be generalized to range pattern/literals (https://github.com/dotnet/csharplang/issues/198), as it is in Rust.

Richiban commented 6 years ago

I love this idea, since it's one of the features that (it feels like) almost all major languages support now but not C#!

I personally prefer the .. operator but that's probably because that's what F# uses and I already feel that the colon in C# is already very overloaded.

I'd love if this is able to be overridden as an operator:

public static BigInteger operator..(BigInteger start, BigInteger end)
{
// Blah
}

It gets even more wonderful if we can combine it with Shapes / Typeclasses (#110), where a range can be generated automatically for any type that defines (+), One, and (<). E.g.

IEnumerable<T> Range<T>(T start, T end)
{
    var current = start;

    while (current < end)
    {
        yield return current;
        current = current + T.One;
    }
}
solvingproblemswithtechnology commented 6 years ago

I see that all of discussion is around the "normal slice" notation instead of the "extended" one which I find much more useful. The extended notation also have a third argument named "step". Typical examples:

var numbers = Enumerable.Range(0, 20);
var even = numbers[::2]
var odd = numbers[1::2]
var reverse = numbers[::-1]

It makes it much more powerful and complete than the proposed slicing. I think that the ':' syntax makes it more compact and it's only a single character, which seems cleaner than [1....-1] [1..(empty)..-1].

The slice itself should have only reference to the items in the collection, not a reference to the collection itself, as I read in some comments, so it can be collected along with the other items.

Hope this comment helps a bit. And I can't tell how much I would like this feature to simplify my development and make C# more powerful.

jnm2 commented 6 years ago

How often do you use steps other than 1?

bondsbw commented 6 years ago

I would use the negative step more than any others. Though it isn't necessary if we have a way to specify negative length.

solvingproblemswithtechnology commented 6 years ago

@jnm2 In matemathical and simple string works for example. For zoom in graphs (you can use the step to have different levels of detail), etc... There are many different uses for that beyond the examples of even/odd numbers.

@bondsbw numbers[1:-1] is not the same as numbers[1::-1]. The first one is forward, from 0 to 10 for example, while the second one is reverse, from 10 to 0 for example.

ErikSchierboom commented 6 years ago

What's the current status on this?

gafter commented 6 years ago

A preview is being prepared and will be made available within a day or two.

ErikSchierboom commented 6 years ago

Great, looking forward to it.

gafter commented 6 years ago

The preview is now live at https://github.com/dotnet/csharplang/wiki/vNext-Preview

LYP951018 commented 6 years ago

I still could not figure out why it is worth putting this syntax into language. XXX.Slice is much more clear, and it is only 4-5 characters more than the syntax. As to Range, why not adopt the Shapes proposal and generalize Enumerable.Range to make it accept any incrementable type? Then with using static Enumerable, Range is also only 3-4 characters more than the new syntax.

Please, STOP adding these useless, "fancy" syntax to language, which increasing the complexity of the compiler AND the language unnecessary.

bbarry commented 6 years ago

Actually @LYP951018, adding syntax like this allows the language to specify a pattern that lowers into more optimal representations than would otherwise be possible due to a requirement for maintaining compatibility.

For example while:

//int[] array
var items = array[1..3];

might compile to:

Span<int> items = array[Range.Create(1, 3)];

a different usage like:

foreach (var i in array[1..10]) { ... }

may become:

for(int _i = 1; _i <= 10; _i++ ) { var i = array[_i];  ... }

The language can make this possible because the range syntax can be specified to be free from observable side effects (I'm not saying that specific optimization will happen, but rather that there exists a class of optimizations that are allowable when bringing in a new syntax that is not when relying on a library implementation alone).

iam3yal commented 6 years ago

@LYP951018

Please, STOP adding these useless, "fancy" syntax to language, which increasing the complexity of the compiler AND the language unnecessary.

Two things:

  1. Saying that it increases the complexity of the compiler means nothing, with all due respect it's basically nonsense without telling people how the compiler is affected.

  2. "Increasing the complexity of the language" so what? so do many other features, if you don't need it, don't want to use it then don't upgrade or just ignore it exists but if you want to convince people that it's not worth it you should make a compelling arguments against it so please START there.

LYP951018 commented 6 years ago

@eyalsk

if you don't need it, don't want to use it then don't upgrade or just ignore it

So, we could add as many features as we can because users could "just ignore" them? Also, what I want to say is "unnecessary".

so please START there.

I have written why I think this feature is useless.

iam3yal commented 6 years ago

@LYP951018 Yes you wrote why in your opinion it's useless, you just didn't provide any facts as to why it's useless.

LYP951018 commented 6 years ago

@eyalsk So which sentence isn't fact? Also, I haven't seen any fact that indicates the feature is useful. :)

iam3yal commented 6 years ago

@LYP951018 Well, again, two things:

  1. If you think this feature is useless then please explain why it's useless.

  2. If you claim it increases the complexity in the compiler and the language then please explain what you mean by that because it can mean many things and is open to interpretation.

Now, telling people that they shouldn't add this feature because you think it's useless isn't an argument, telling them that it increases the complexity without any facts to prove your claims is again just as useless as telling them the feature is useless but if you think that this is the best you can do I won't be the one to stop you but I'll just be a good community member and tell you it's not really the kind of feedback they are looking for but obviously you can do whatever you like. :)

LYP951018 commented 6 years ago

@eyalsk

If you think this feature is useless then please explain why it's useless.

I have written down:

I still could not figure out why it is worth putting this syntax into language. XXX.Slice is much more clear, and it is only 4-5 characters more than the syntax. As to Range, why not adopt the Shapes proposal and generalize Enumerable.Range to make it accept any incrementable type? Then with using static Enumerable, Range is also only 3-4 characters more than the new syntax.

Ok, it's not useless. My opinion is actually "benefit little".

Thaina commented 6 years ago

I too was against the syntax. Just have indexer that accept Range and return Span or IRefEnumerable is enough

The syntax is useless because we should actually use indexer, not making a syntax that look like indexer

Maybe

class Array<T>
{
    Span<T> this[Tuple<int,int> range] => get;
}

var span = array[(1,10)]; // can it still be optimized by compiler? We know that this is array and this tuple is constant
iam3yal commented 6 years ago

@LYP951018 Okay, not sure whether you watched build 2018 but here it is where Mads and Dustin are speaking about it. Hope this clarifies some things.

LYP951018 commented 6 years ago

@eyalsk After watching the video, I feel more awful...

  1. indexing string and returning code units is error prone.... for example,
string str = "😀";
str[^1] // what's this?

Modern programming languages, like Rust, does not provide index operation for strings at all, see std::string::String. I think this fancy syntax make the situation worse, like:

string str = "h🙂😄";
var sp = str[0..2]; //what's this again?
  1. If we could have something, like IReversableEnumerator and IIndexableEnumerator (also, Rust, C++ have something like this), we could just (without allocation, I know today's implmentation of Reverse has allocation)
var reversedString = str.Reverse();
char last = reversedString[1];

which is more readable than str[^1], and less magic (library only).

bondsbw commented 6 years ago

@LYP951018

string str = "😀";
str[^1] // what's this?

I would expect the exact same result as str[str.Length - 1].

string str = "h🙂😄";
var sp = str[0..2]; //what's this again?

I would expect the same result as str.Substring(0, 2).

var reversedString = str.Reverse();
char last = reversedString[1];

which is more readable than str[^1], and less magic (library only).

I completely disagree. I have to parse several items to understand exactly what you are trying to accomplish, while it's more intuitive with the shorthand syntax.

ufcpp commented 6 years ago

@LYP951018

Tensor x = new Tensor();

// unclear
var y1 = x.Slice(1, x.Length(0) - 1, 1, x.Length(1) - 1, 1, x.Length(2) - 1);

// too long
var y2 = x.Slice(
    start1: 1, endExclisive1: x.Length(0) - 1,
    start2: 1, endExclisive2: x.Length(1) - 1,
    start3: 1, endExclisive3: x.Length(2) - 1);

// too long
var y3 = x.Slice(
    Range.Create(1, new Index(1, true)),
    Range.Create(1, new Index(1, true)),
    Range.Create(1, new Index(1, true)));

// with range operator
var y4 = x.Slice(1..^1, 1..^1, 1..^1);
ufcpp commented 6 years ago

@LYP951018

// https://github.com/ufcpp/GraphemeSplitter
// https://www.nuget.org/packages/GraphemeSplitter/
using GraphemeSplitter;
using System;
using System.Linq;

static class GraphemeExtensions
{
    public static string Grapheme(this string s, Index index)
    {
        var g = s.GetGraphemes();
        var i = GetIndex(index, g.Count());
        return g.ElementAt(i).ToString();
    }

    public static string Grapheme(this string s, Range r)
    {
        var g = s.GetGraphemes();
        var (gs, gl) = GetStartAndLength(r, g.Count());
        var start = g.Skip(gs).First().Offset;
        var length = g.Skip(gs).Take(gl).Sum(seg => seg.Count);
        return s.Substring(start, length);
    }

    private static (int start, int length) GetStartAndLength(Range range, int entityLength)
    {
        var start = range.Start.FromEnd ? entityLength - range.Start.Value : range.Start.Value;
        var end = range.End.FromEnd ? entityLength - range.End.Value : range.End.Value;
        var length = end - start;

        return (start, length);
    }

    private static int GetIndex(Index index, int entityLength)
    {
        return index.FromEnd ? entityLength - index.Value : index.Value;
    }
}

public class Program
{
    static void Main()
    {
        string str = "h🙂😄";
        Console.WriteLine(str.Grapheme(^1)); // 😄
        Console.WriteLine(str.Grapheme(0..2)); // h🙂
    }
}
LYP951018 commented 6 years ago

@bondsbw

I would expect the exact same result as str[str.Length - 1].

which is bad.

I would expect the same result as str.Substring(0, 2).

which is also bad.

I completely disagree. I have to parse several items to understand exactly what you are trying to accomplish, while it's more intuitive with the shorthand syntax.

  1. This library way is more general, and could remove allocations in Reverse.

  2. Er... serveral items? maybe two? Then ^1 has two items, too: '^' token, and '1', and I also have to parse them, EVERYTIME I use it. But with reversedString, I only parse it once.

@ufcpp

So is this feature ONLY useful for tensor? OMG... How many C# users use Tensors??? Also, as @Thaina said, ValueTuple literals is good enough.

Even, we could have user-defined literals like C++ 11, which is much more general way compared to this ^.

Also, providing indexing for graphemes is a bad idea because indexing should be a O(1) operation, but in the grapheme case, it's O(n).

CyrusNajmabadi commented 6 years ago

So is this feature ONLY useful for tensor?

Roslyn has many usages of sequences of data. I can see many places where this would be helpful.

It's a pretty poor argument to take someone's example of usefulness and then imply that that's hte only good use case for this feature. The example was for demonstrative purposes. It was clearly not implying that the feature exists purely for that case.

CyrusNajmabadi commented 6 years ago

Also, as @Thaina said, ValueTuple literals is good enough.

The LDM disagrees. There's been enough feedback over the years that this is unpleasant. It's come to a particular head with all the emphasis on making working with sequences of data easier.

Also, as @Thaina said, ValueTuple literals is good enough.

You are welcome to use the types explicitly. The syntax is just shorthand for them. The LDM feels like there is enough value in providing explicitly syntax to warrant providing a nicer shorthand. You may disagree. And that's ok. You don't have to use it after all :)

which increasing the complexity of the compiler AND the language unnecessary.

The compiler complexity here is extremely minimal. The LDM is populated by people working directly on the compiler. They take complexity very seriously, and will not add language features whose cost don't justify their value. That's not the case here. This is a nice (certainly not huge) improvement, with low cost. And it makes working with a lot of these types nicer and cleaner for many developers.

the language unnecessary.

Technically most language features that have come to c# has been 'unnecessary'. They don't actually let you do something you couldn't do before.

'var'? You could just write the type. 'queries'? You could just write out the calls by hand. 'async/await'? you could just use callbacks.
'lambdas'? you could just write the methods yourself and get delegates to them. 'tuples'? you could just make your own types.

etc. etc. etc. etc.

The purpose of hte LDM is to decide at which point adding an actual language feature adds enough value to outweigh the negatives that come along with adding language features. This has been the way the LDM has worked since the beginning and its the same process and system that has led to all the improvements since V1.

LYP951018 commented 6 years ago

@CyrusNajmabadi Yes, that is poor. But I think tensor may be the only use case that benefit A LOT (change str.SubString to str[m..n] is NOT a lot, that's minor, as I said previously).

Also, I was always saying that we have other more general ways to do what this proposal does.

The compiler complexity here is extremely minimal.

The sum of minimal complexity is not minimal.

Technically most language features that have come to c# has been 'unnecessary'. They don't actually let you do something you couldn't do before.

Oh no. The features you list could significantly save my typing, prevent me from repeating some verbose, error prone patterns. But this proposal NOT.

After these discussion, I still could not figure out why this proposal could make my life easier (if I don't use tensor)...

Finally, I could ignore the proposal at all, but I don't wanna C# to be a feature monster: too many features which is not general enough...

LYP951018 commented 6 years ago

Forget it. Since the C# team has decided to put this into C# 8, the discussion is just pointless. Hope I could enjoy this feature in the near future.

ashmind commented 6 years ago

What's the latest Range spec? Still unclear on whether foreach (var i in 1..3) should work out of the box once this is finalized.

I don't see any use cases for slicing (in my code), so only interested in enumerations and pattern matching, e.g. case 1..3:.

If this is slicing-only, is .. a final decision? Would have preferred something like : for a slicing edge case, and .. for proper ranges.

bondsbw commented 6 years ago

@LYP951018

Forget it. Since the C# team has decided to put this into C# 8, the discussion is just pointless.

I think it's important to have these discussions even when a feature had been demoed at a conference. Several features which have been demoed in the past did not get implemented in the targeted version. Some still haven't been implemented.

CyrusNajmabadi commented 6 years ago

Yes, that is poor. But I think tensor may be the only use case that benefit A LOT (change str.SubString to str[m..n] is NOT a lot, that's minor, as I said previously).

This is a strawman. As i mentioned already, Roslyn itself deals with sequential data a lot. Lists/Sequences of things are very common across the board, and there is often a desire/need to work with these sequences in these manners. You are taking an example you don't like "strings" and poo-pooing the feature, and you are using another "example" 'tensors' as a means to ridicule things. These have just been examples. They were an attempt to demonstrate that if you work with sequential data it can often be very nice to think about working with that in a range-based manner, and once you start doing that, having a cleaner and easier-to-use syntax can be nicer.

Is it OMGHUEGIMPORTANT. No, of course not. Is it OMGHUEGNECESSARY? No, of course not. People coudl just use the structs themselves. But we've seen enough feedback and enough usage already and interest toward using these types in many codebases (including CoreFx) that we think there is value in making it more pleasant.

That's all. Just an extremely low cost attempt to make the language more pleasant, especially for all the people we think will be using thse types all the time.

CyrusNajmabadi commented 6 years ago

Also, I was always saying that we have other more general ways to do what this proposal does.

As i've stated, the point was precisely to give specialization because it is felt that this will be common enough to warrant it. i mean, there were general ways to accomplish what async/await did. But specializing this pattern to make life pleasant was worth it. There were general ways to do deconstruction. But having specialized ways to do it was worth it. Same with pattern matching/linq/etc.etc.

As i explained almost every language feature we've added in the last 15 years already had a 'general way' to accomplish it. That was entirely the point. The language creates features most of the time because the feeling is that the general way is just not going to be as nice to use as a specialized construct.

CyrusNajmabadi commented 6 years ago

The sum of minimal complexity is not minimal.

I don't know what this means. The actual engineers working on the compiler and who contribute directly to the LDM have a deep understanding of this space and do not consider this hugely complex. So i have no idea what your basis is for these claims.

A someone who was principally responsible for a lot of the roslyn codebase for many years, i can tell you this is absolutely what would be considered minimal. Not as 'micro' as something like "binary literals" but still, extremely tiny.

CyrusNajmabadi commented 6 years ago

Oh no. The features you list could significantly save my typing, prevent me from repeating some verbose, error prone patterns. But this proposal NOT.

I disagree. I think it's much more confusing to do length based computations on sequences. And i've loved languages that have given me these sorts of tools to make thins clearer. IN a similar vein, while i could accomplish the same stuff today by explicitly creating and using types like "Range" and "Index" it would be excessively verbose.

To me, it's very similar to 'generic inference'. Sure, we coudl have insisted you write: customers.Select<Customer, int>((Customer c) => c.Age), but we thought that saving you the typing and limiting the verbosity there was valuable. So we made sure that you could type: customers.Select(c => c.Age). Sure. I could have written it myself. It wouldn't really have been error prone. It would just have been tedious and numbing, and would have filled my code with so many types that were clear from context.

It's the same for me with ranges/slices/indexes/etc. Sure, i could write out explicit types for all of these... but geez, why would i want to. It's just so verbose and heavyweight and just hides the intent of the code which can so clearby expressed with a simpler syntax.

CyrusNajmabadi commented 6 years ago

Forget it. Since the C# team has decided to put this into C# 8, the discussion is just pointless

The feature is still under active design. You should still provide feedback. Of course, i would recommend not expecting that the outcome will def be what you want.

For example, there are parts of slices/ranges that i don't like. I've provided my feedback on it with my reasoning. I hope it will be considered. But it may very well not be. C'est la vie.

bondsbw commented 6 years ago

I would expect the exact same result as str[str.Length - 1].

which is bad.

The point of this feature isn't to fix System.String. There are other proposals to do that. This feature would presumably work with a new UTF-8 string implementation in the way you suggest.

Joe4evr commented 6 years ago

Did I miss something about the demo with the ^n literal syntax?

It seems inconsistent that [0] gives the first element, but [^1] the last element. It may be widely preferable if it were [^0] instead.

CyrusNajmabadi commented 6 years ago

Note: tihs conversation went over to gitter here: https://gitter.im/dotnet/csharplang?at=5af73120a2d9513633350651

I agree with @Joe4evr The way reverse indexing works by default feels counterintuitive to me. It feels like i want to explain how many steps i'm moving "from the end" just how things work with start indexes.

Note: part of the problem here seems to be that we only have one range operator which is exclusive. if i could have an inclusive range operator, i would then like to be able to write:

0..=^0 where ..= is an inclusive range.

Has there been enough consideration given to providing all the reasonable forms of inclusive/exclusive ranges/indexing? I personally don't think one single form will be enough. it def won't be one-size fits all (IMO).

--

Note: after lots of community discussion, the rust guys just added inclusive ranges in their language as of 1.26. I have a feeling if we don't do this now, we'll just end up needing to do it in the future.

--

I've also just been a fan of allowing [a, b] and [a, b) as the notation. Very close to the mathematical forms people are already taught, and it keeps the inclusive/exclusive notation out of the way of reverse-indexes. i.e. [a, ^b] and [a, ^b) read a lot better to me than a..=^b and a..<^b and a..^b

jnm2 commented 6 years ago

@Joe4evr:

so wouldn't it make more sense that str[^0] is the last char of some string? since str[0] is the first

Not if you see from the perspective that zero-based indexing doesn't mean the first item is 0; it means that 0 is where that item starts and 1 is where it stops. 0.5 indicates that you've traveled halfway across the item. This isn't the counting intuition we learned as kids, but it's the only one that makes sense in the face of fractions. Beyond that consistency, it also immediately gets rid of having to remember to add and subtract 1 when measuring lengths.



hat



Don't think of it like a calendar; think of it like a tape measure.


0 means the first tick mark, having traveled across 0 characters from the beginning. ^0 means the last tick mark, having traveled across 0 characters from the end.

Then, 0..^0 takes the whole string. 0..1 collects a single character (into a string), just like str[0]. 0 and 1 are not referring to characters; they are referring to tick marks between the characters.

Our current indexers imply n..(n+1) when we say str[n]. str[6] throws since 6..(6+1) is past the end of the string. str[^0] would also throw, since ^0..(^0+1) is also past the end. (The +1 doesn't flip. That would be truly confusing.)

CyrusNajmabadi commented 6 years ago

@jnm2 Did you make that image yourself? It's great! I think it should likely be included in all user facing documentation around ranges/slices/etc. :)

Joe4evr commented 6 years ago

0 means the first tick mark, having traveled across 0 characters from the beginning. ^0 means the last tick mark, having traveled across 0 characters from the end.

Yes, but if you're traveling from the end, you're going in the opposite direction, ie towards the beginning. Putting it this way: If str[1] is idempotent to str.ElementAt(1), then str[^1] should be idempotent to str.Reverse().ElementAt(1).

qrli commented 6 years ago

The image of @jnm2 is exactly the way of thinking when I was working with pixel-accurate graphics. But it could be not so intuitive for many.

Another way of thinking it: ^n is the short hand for Length - n. Then it is very clear that ^1 is the last instead of ^0. I don't think the Reverse().ElementAt(1) way of thinking fit it well, because the sequence is never physically/logically reversed.

Though, I don't know the story why ^ is chosen, while I see Python/Powershell/Matlab/etc. use just a negative index number, where it is obvious that -1 is the last while 0 cannot.

That ^ also makes the Range type need to carry more properties than Start/End.

TylerBrinkley commented 6 years ago

I prefer -1 to ^1. It's more intuitive and more common elsewhere from my experience and also better explains starting at a value of 1 vs 0. Also it's confusing as it could easily be interpreted as the unary bitwise not operator which in the case of ^1 would evaluate to -2.

Thaina commented 6 years ago

@qrli I don't think it related to graphics

We are dealing with index so if it is -0 it not go anywhere

In other language there are also negative slice, I think js too, has slice -1 mean the last one