Raku / problem-solving

🦋 Problem Solving, a repo for handling problems that require review, deliberation and possibly debate
Artistic License 2.0
69 stars 16 forks source link

Comma-separated `...` "triple-dot" sequences (e.g. for array indexing), produce bizarre results. #407

Open jubilatious1 opened 7 months ago

jubilatious1 commented 7 months ago

"Triple-dot" (...) sequences are useful for array indexing/subsetting. But comma-separated "Triple-dot" (...) sequences produce bizarre and unpredictable results:

~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.

To exit type 'exit' or '^D'
[0] > say grep({$_ == 1}, 0...5)
()
[0] > say 0...5
(0 1 2 3 4 5)
[0] > say 0...5,3...7
(0 1 2 3 4 7)
[0] > say 0...5;3...7
(0 1 2 3 4 5)
[0] > 0...5,3...7
(0 1 2 3 4 7)
[1] > (0...5,3...7)
(0 1 2 3 4 7)
[2] > (0...5,3...7,)
(0 1 2 3 4 7)
[3] > (0...5,6...7,)
(0 1 2 3 4 5 6 7)
[4] > (0..5,3..7,)
(0..5 3..7)
[5] > put (0..5,3..7,)
0 1 2 3 4 5 3 4 5 6 7

Also (thanks to @doomvox for whittling this down):

## seems strange:
say 0...5,3...7;
# (0 1 2 3 4 7)

## is raku parsing it like this?
say (0)...(5,3)...(7);
# (0 1 2 3 4 7)

## so let's try that in pieces:
say (0)...(5,3);
# (0 1 2 3 4 5 3)

## and...
say (5,3)...(7);
# ()

## Here there be LTA afoot.

(see https://github.com/doomvox/raku-study/blob/2e645a58a3be6fcb26b84abf44c42756cb96c1b6/notes/meeting_2023sep10.org#L125)

Special thanks to the "Raku Study Group" for taking a look at this during our 2023 Sept 10 Meetup.

@TimToady

jubilatious1 commented 7 months ago

For example, I download a CSV file from here:

https://www.microsoft.com/en-us/download/details.aspx?id=45485

Then I try to subset columns, let's say "First Name", "Last Name", "Address", "City", "State or Province", "ZIP or Postal Code", "Country or Region". In bash or zsh (output columns visualized in Vim):

`$ perl6 -ne '.split(",")[1...2,10...14].say;'  Import_User_Sample_en.csv
( First Name  Last Name          Address  Country or Region)
(      Chris      Green  1 Microsoft way      United States)
(        Ben    Andrews  1 Microsoft way      United States)
(      David   Longmuir  1 Microsoft way      United States)
(    Cynthia      Carey  1 Microsoft way      United States)
(    Melissa    MacBeth  1 Microsoft way      United States)

Above using Raku I lose the "City", "State or Province", "ZIP or Postal Code" columns. Not sure what's going on here.

Below, a similar example in the R-Programming language (R-Console, i.e. REPL):

> read.csv("/Users/admin/Import_User_Sample_en.csv")[,c(2:3,11:15)]
  First.Name Last.Name         Address    City State.or.Province ZIP.or.Postal.Code Country.or.Region
1      Chris     Green 1 Microsoft way Redmond                Wa              98052     United States
2        Ben   Andrews 1 Microsoft way Redmond                Wa              98052     United States
3      David  Longmuir 1 Microsoft way Redmond                Wa              98052     United States
4    Cynthia     Carey 1 Microsoft way Redmond                Wa              98052     United States
5    Melissa   MacBeth 1 Microsoft way Redmond                Wa              98052     United States
> 

The R-Programming language gives the desired/expected answer.

@TimToady

coke commented 7 months ago
`$ perl6

We're called raku now. And while I know this ticket is about ..., please note that your example works as you expect when using ...

jubilatious1 commented 7 months ago

@coke

Except it doesn't ( '...work as you expect when using .. ranges...' ).

~$ raku -ne '.split(",")[1..2,10..14].say;'  Import_User_Sample_en.csv
((First Name Last Name) (Address City State or Province ZIP or Postal Code Country or Region))
((Chris Green) (1 Microsoft way Redmond Wa 98052 United States))
((Ben Andrews) (1 Microsoft way Redmond Wa 98052 United States))
((David Longmuir) (1 Microsoft way Redmond Wa 98052 United States))
((Cynthia Carey) (1 Microsoft way Redmond Wa 98052 United States))
((Melissa MacBeth) (1 Microsoft way Redmond Wa 98052 United States))

Which is why a newbie might reach for ... sequences instead.

(Not saying non-flattening is a bad thing--but naive code doesn't produce a naive answer).

coke commented 7 months ago

Apologies, I wasn't clear you wanted flattening - you can, of course, specifically flatten the combined ranges if you like, but I realize that's probably not helpful for the original ask.

jubilatious1 commented 7 months ago

I just think it's a common task...get a list of values (let's say comma-separated), split on the separator, and select out desired elements. So for example, get rows of employee information and drop the phone numbers to create mailing labels.

Not to belabor the point, but a newbie might continue on such a Raku journey thusly (having heard the | operator is useful for flattening), and still not get anywhere:

~$ raku -ne '.split(",")[|(1..2,10..14)].say;'  Import_User_Sample_en.csv
((First Name Last Name) (Address City State or Province ZIP or Postal Code Country or Region))
((Chris Green) (1 Microsoft way Redmond Wa 98052 United States))
((Ben Andrews) (1 Microsoft way Redmond Wa 98052 United States))
((David Longmuir) (1 Microsoft way Redmond Wa 98052 United States))
((Cynthia Carey) (1 Microsoft way Redmond Wa 98052 United States))
((Melissa MacBeth) (1 Microsoft way Redmond Wa 98052 United States))

Maybe now you can see why a newbie might say, "hey I'll try ... 'triple-dot' sequences instead".

coke commented 7 months ago

Oddly (to me), flat works, but | does not:

$ raku -e ' dd |(1..4, 10..12)' # syntax requires parens
1..4
10..12
$ raku -e ' dd  flat 1..4, 10..12' #same result with or without parens
(1, 2, 3, 4, 10, 11, 12).Seq
2colours commented 7 months ago

That's probably because flat somehow list-ifies its arguments, while a mere slipping wouldn't go that far. Actually, what strikes me is more why flat does that. (1..4, 10..12) is a flat, two-element List containing two Ranges. But this would probably lead very far.

jubilatious1 commented 7 months ago

I'm going to cut to the chase here and suggest that the problem results from improper invocation of the OEIS System.

There's no way the following should happen:

~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.

To exit type 'exit' or '^D'
[0] > (1..4, 10..12)
(1..4 10..12)
[1] > put (1..4, 10..12)
1 2 3 4 10 11 12
[1] > put (1...4, 10...12)
1 2 3 4 10 12
[1] > put (1..6, 3..8)
1 2 3 4 5 6 3 4 5 6 7 8
[1] > put (1...6, 3...8)
1 2 3 4 5 8

A newbie should be able to figure out how to drop/duplicate an element from a List/Array in 30 seconds or so. Every ... triple-dot return above produces bizarre results, undermining confidence in the entire ../... range/sequence system. It's less insanely brilliant than brilliantly insane. Triple-dot ... sequences (and ^ endpoint-less variants) should be thought of as non-lazy ranges that can be combined easily with commas. Period.

We can do better. If the problem turns out to be the OEIS System, then the OEIS System needs it's own methods/routines/functions:

#wished-for solution, REPL-like example:

[0] > put oeis(1,3,5,7...13)
1 3 5 7 9 11 13
[0] > put oeis 1,3,5,7...13
1 3 5 7 9 11 13
[0] > put (1,3,5,7...13).oeis
1 3 5 7 9 11 13
[0] > (1,3,5,7...13).oeis.put
1 3 5 7 9 11 13

Thank you for your kind attention.

@TimToady @thoughtstream @pmichaud

librasteve commented 2 months ago

I just realised that this is an open issue. There is a StackOverflow question that also has some context in relation to two comma-separated triple-dot sequence operators. I commend the comments of @raiph and @brad_gilbert.

I will try to get that question and this issue into my head and see if either sheds light on the other.

librasteve commented 2 months ago

My current understanding is this:

The triple-dot sequence operator is intended as a way to have a continuum of sequence operators that makes a continuum of data points when comma separated.

say 1 ... 3, 7 ... 15, 11 ... 3 ... 1; #(1 2 3 7 11 15 11 7 3 2 1)

Some applications of this are:

There are two mini aspects of this design to note:

Therefore, using comma-separated triple-dot operators as indexes is (usually) wrong.

The double-dot range is usually what you want.

librasteve commented 2 months ago

I tried the above example and it works fine with double-dot operators:

raku -ne '.split(",")[1..2,10..14].flat.say;' Import_User_Sample_en.csv

(First Name Last Name Address City State or Province ZIP or Postal Code Country or Region)
(Chris Green 1 Microsoft way Redmond Wa 98052 United States)
(Ben Andrews 1 Microsoft way Redmond Wa 98052 United States)
(David Longmuir 1 Microsoft way Redmond Wa 98052 United States)
(Cynthia Carey 1 Microsoft way Redmond Wa 98052 United States)
(Melissa MacBeth 1 Microsoft way Redmond Wa 98052 United States)

NB. the .flat to avoid returning two lists - since lists do not itemize, (ie not $(a,b), $(c,d,e,f) but (a,b,),(c,d,e,f) then .flat is effective - I think the idea is to give the coder the option to preserve the index structure or to flatten it explicitly

librasteve commented 2 months ago

This is what is going on within the index with .slip and .flat ... I would say it is strangley consistent.

> ddt (1..2,10..14)
(2) @0
├ 0 = 1..2.Range
└ 1 = 10..14.Range
> ddt |(1..2,10..14)
1..2.Range
10..14.Range
> ddt (1..2,10..14).flat
.Seq(7) @0
├ 0 = 1   
├ 1 = 2   
├ 2 = 10   
├ 3 = 11   
├ 4 = 12   
├ 5 = 13   
└ 6 = 14   

^^ so, in a microcosm of what you can do with the results, you can flatten the index to get the same outcome

librasteve commented 2 months ago

so far, my theory of how and why triple-dot operators work the way they do matches all the examples above, then I read this one

say 1...6, 3...8;   #(1 2 3 4 5 8)

oh, merde

and then I found another bad apple

say  1...6,4...10;   #(1 2 3 4 5 10)

I would say that this is a bug (or at least cause for a proper explanation)

pmichaud commented 2 months ago

If someone already pointed this out and I missed it, my apologies.

TLDR: The comma operator has higher precedence than the sequence operator. Both are list associative. You can't combine sequences using a simple comma -- parens are required.

say flat (1...6), (4...10); # (1 2 3 4 5 6 4 5 6 7 8 9 10)

Longer read:

Because the comma is higher precedence than sequence, a statement like

say 1...6,4...10;

gets parsed as if it is

say 1...(6,4)...10;

Since ... is list associative, it probably gets invoked something like

say &infix:<...>(1, (6,4), 10); # (1 2 3 4 5 10)

Here's a different example that might help to illustrate what is happening:

say 1...4, 7...20; # (1 2 3 4 7 10 13 16 19 20)

Note how the (4,7) produces a "increment by 3 sequence" all the way up to the 20, with the sequence up to the 4 tacked onto the beginning.

For something like (6,4) as the middle argument to ..., the sequence operator deduces a descending sequence ( 6, 4, 2, 0, -2, -4 ... ) and because 6 is already beyond the endpoint (10) the whole thing becomes an empty list.

say 6, 4 ... 10; # ()

That's likely why the 6 and 4 disappear entirely from the original example -- because the (6,4) ... 10 sequence results in an empty list. (I'm at a bit of a loss as to why the 10 still shows up.)

say 1...6, 4...10; # (1 2 3 4 5 10)

If you're thinking "oh, let's change precedence of comma and sequence"... the comma operator pretty much has to be higher precedence than the ... sequence operator in order for the following to work:

say 1, 2, 4 ... 256; # (1 2 4 8 16 32 64 128 256)

In this last example, the ... operator receives two arguments, one is a List with three values (1,2,4) and the other is an Int (256).

Hope this is a bit helpful. Again, the bottom line is that you can't concatenate sequences using just a comma, because comma has higher precedence than sequences. Parentheses (and possibly "flat") are needed to concatenate two sequences.

Pm

librasteve commented 2 months ago

great explanation, particularly about the descending list being empty ... my understanding from the doc (as quoted above in my previous note) is that the final value is always produced unless a '^' caret prefix is used

raiph commented 2 months ago

@librasteve

my understanding from the doc (as quoted above in my previous note) is that the final value is always produced unless a '^' caret prefix is used

That's only true of the current behavior for chained sequences, not standalone ones:

say 1,3...      10; # (1 3 5 7 9)
say 5,7...      10; #     (5 7 9)
say 1,3...5,7...10; # (1 3 5 7 9 10)

@MustafaAydin first wrote about this in the SO, and it still seems wrong and I don't think anyone has explained it.

(Rereading my SO comments I see how I may have accidentally given the impression to you I had figured it out. I haven't. I'm still with MustafaAydin ("I mean 4, 7 ... 15 alone produces (4, 7, 10, 13). But 1... 4, 7...15 now produces 7, 10, 13, 15 in the tail. Why is 15 included? Maybe i'm missing something idk") and pmichaud: "(I'm at a bit of a loss as to why the 10 still shows up.)").)

raiph commented 2 months ago

Some testing related to this has led me to further oddities. Maybe I'm missing something? I'll leave them here for now given that this issue already exists with a very generic title and is currently still open. I prefer to avoid generating an uncontrolled blizzard of ... issues before we're sure they're well founded. Two bugs for the price of one comment?:

say 1 ...^ 3;          # (1 2)
#say 1 ...^ 3 ...^ 5;  # Error while compiling...
                       # Calling infix:<...^>(Int, Int, Int) will never work with
                       # signature of the proto ($, Mu, *%)
say 1 ...^ 3,4 ...^ 5; # Too many positionals passed; expected 2 arguments but got 3

🤪

lizmat commented 2 months ago

FWIW, the whole of ... is too magic: I once spent several weeks trying to make it sane without breaking any spectest, but that's just not possible.

The only thing I use ... for is to be able to do 10 ... 1. And I would recommend other people to only use that aspect of it.

librasteve commented 2 months ago

I have spelunked into the spectest for chained sequences.

It looks to me that @pmichaud is 100% correct in his comment above since the tests reflect this.

The question mark about (non-excluded) endpoint values being tacked on only to chained sequences remains.

There is a commented out spectest at L63:

# The following is now an infinite sequence...
# is (0, 2 ... 7, 9 ... 14).join(' '),
#     '0 2 4 6 7 9 11 13',
#     'chained arithmetic sequence with unreached limits';

My belief is that this test is correct, and that the current behaviour would fail it - i.e. this is a bug.

It is a mystery why it was commented out, and it is clearly not an infinite sequence.

Unless there is any dissent, I propose that we now focus this issue on fixing this bug.

PS. I think the behaviours that @raiph outlines is caused by the language implementation and tests not covering chaining of sequence operators with cats-ears and that for now, these should fail with an error msg like "chaining sequences with cat-ears is not yet implemented"

jubilatious1 commented 2 months ago

I tried that in an old REPL (2023.05) and the 14 endpoint shows up:

~ % raku
Welcome to Rakudo™ v2023.05.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2023.05.

To exit type 'exit' or '^D'
[0] > (0, 2 ... 7, 9 ... 14).join(' ')
0 2 4 6 7 9 11 13 14
[1] >

And...also if you swap the 7, 9 around to read 9, 7 you get the 14 endpoint again, but the second "sequence" disappears otherwise:

[1] > (0, 2 ... 9, 7 ... 14).join(' ')
0 2 4 6 8 14
[2] >
librasteve commented 2 months ago

@jubilatious1 - you are correct, this is also the behaviour of Welcome to Rakudo™ v2024.03.

You will have to go back 13 years plus to get back to anything different since the L63 spectest I mention above was commented out back then and has not been part of the rakudo release checks since. It was marked as "obsolete".

My point is:

say 1,3...      10; # (1 3 5 7 9)             <== good, the endpoint (10) is not produced
say 5,7...      10; #     (5 7 9)             <== same
say 1,3...5,7...10; # (1 3 5 7 9 10)          <== bad, the chained behaviour should match non-chained

The L63 test was designed to catch this issue - so the historic situation agrees with our desired behaviour.

BUT - someone erroneously removed that test and now we are getting undesired behaviour.

ab5tract commented 2 months ago

I'd rather we simply throw an exception in the case of commas being used to conjoin seqeunces, rather than try to make the endpoints align.

As @lizmat mentioned, it's a non-trivial situation to try and make sane changes to the ... operator.

jubilatious1 commented 2 months ago

@ab5tract said:

I'd rather we simply throw an exception in the case of commas being used to conjoin seqeunces, rather than try to make the endpoints align.

I hesitate to write this but if you want Raku to be adopted by the Data Science community, you'll have to figure out a way to let programmers reliably input (discontinuous) integer sequences.

Can anyone tell me what they would expect this code to return?

c( 0 : 9, 1 : 10, 2 : 11 )

This is what the R-programming language returns (in the R-Console a.k.a. REPL):

> c( 0 : 9, 1 : 10, 2 : 11 )
 [1]  0  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7  8  9 10  2  3  4  5  6  7  8  9 10 11
> 

And the reverse:

> c( 11 : 2, 10 : 1, 9 : 0 ) 
 [1] 11 10  9  8  7  6  5  4  3  2 10  9  8  7  6  5  4  3  2  1  9  8  7  6  5  4  3  2  1  0
>

Here's an equivalence test (using rev() to reverse one sequence). The two examples demonstrate Commutative property:

> rev(c( 0 : 9, 1 : 10, 2 : 11 )) == c( 11 : 2, 10 : 1, 9 : 0 )
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> c( 0 : 9, 1 : 10, 2 : 11 ) == rev(c( 11 : 2, 10 : 1, 9 : 0 ))
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
>

Oh, you can call all in R to give "Raku-junction"-like behavior (Commutative property demonstrated with rev()):

> all(rev(c( 0 : 9, 1 : 10, 2 : 11 )) == c( 11 : 2, 10 : 1, 9 : 0 ))
[1] TRUE
> all(c( 0 : 9, 1 : 10, 2 : 11 ) == rev(c( 11 : 2, 10 : 1, 9 : 0 )))
[1] TRUE
>

So why not just steal this c( ) construct from R ? FYI, R is an Open Source project (it was initially named GNU-S).
AFAIK, R is primarily written in C.

"Combine Values into a Vector or List" https://search.r-project.org/R/refmans/base/html/c.html

@pmichaud @thoughtstream @TimToady

lizmat commented 2 months ago

On 25 Apr 2024, at 23:19, jubilatious1 @.***> wrote: I hesitate to write this but if you want Raku to be adopted by the Data Science community, you'll have to figure out a way to let programmers reliably input (discontinuous) integer sequences.

Then let's figure out a syntax that allows that, that does NOT depend on the magic of ...

FCO commented 2 months ago

I might be misunderstanding the problem, but wouldn't something like this be enough?

|^10, |(1..10), |(2..11)
pmichaud commented 2 months ago
> c( 0 : 9, 1 : 10, 2 : 11 )
 [1]  0  1  2  3  4  5  6  7  8  9  1  2  3  4  5  6  7  8  9 10  2  3  4  5  6  7  8  9 10 11

In raku:

> say flat (0...9),(1...10),(2...11);
(0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 10 2 3 4 5 6 7 8 9 10 11)
> c( 11 : 2, 10 : 1, 9 : 0 ) 
 [1] 11 10  9  8  7  6  5  4  3  2 10  9  8  7  6  5  4  3  2  1  9  8  7  6  5  4  3  2  1  0

In raku:

> say flat (11...2),(10...1),(9...0);
(11 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 1 9 8 7 6 5 4 3 2 1 0)

Seems pretty straightforward to me, actually. The pattern even works even for smart sequences:

> say flat (1,2,4...64),(1,3,9...243);
(1 2 4 8 16 32 64 1 3 9 27 81 243)

If writing "flat" and parens is "just too much typing", one can undoubtedly create a local c() function equivalent that provides the flattening and whatever else is wanted. I don't think this specific use case is yet well enough understood or explored to create a custom language construct for it yet.

Pm

librasteve commented 2 months ago

@ab5tract said:

I'd rather we simply throw an exception in the case of commas being used to conjoin sequences, rather than try to make the endpoints align.

That would be sad, because a lot of effort went into designing sequences, including chained ones and there are 43 line of spectest just for the chained examples. My bug fix proposes "just" adjusting the endpoint test ... but I can understand reluctance since anything here is non-trivial.

However, I see the need to focus raku effort on more pressing features, so if we do throw an exception for all chained sequences, I propose it is on the lines of please use brackets and slips when chaining sequences eg. |(1,3...5),|(7,9...10) which I think would boost code readability.

librasteve commented 2 months ago

> (|(0...9), |(1...10), |(2...11)).reverse == (|(11...2), |(10...1), |(9...0)) #True

Usually the range .. operator would be fine (per @FCO ), but the sequence ... operator handles descending values also.