Raku / problem-solving

🦋 Problem Solving, a repo for handling problems that require review, deliberation and possibly debate
Artistic License 2.0
70 stars 16 forks source link

What are the intended semantics of <== and <<== #27

Closed lizmat closed 4 years ago

lizmat commented 5 years ago

See https://github.com/rakudo/rakudo/issues/2899 for the start of this discussion.

Kaiepi commented 5 years ago

With https://github.com/rakudo/rakudo/pull/2903, I think <== and ==> are brought in line with the spec.

The question is how are <<== and ==>> supposed to work? Should code like this be allowed to run?

[4,5,6] ==>> [1,2,3] ==>> my @foo;

Or should only one appending feed operator be allowed at a time?

my @foo;
@foo <<== [1,2,3];
@foo <<== [4,5,6];

If more than one should be allowed, should they be allowed in combination with their respective assigning operators, like this?

my @even <== grep { $_ %% 2 } <== 1..^100;
@even <<== grep { $_ %% 2 } <== 100...*;
Kaiepi commented 5 years ago

Also, from the parallelization pullreq:

There's a problem with this... this benches slower than the current implementation of feed operators, even when there's blocking I/O going on at the same time. I think more discussion needs to be made about whether or not this should be implemented.

Feed operators were benching much faster in the first pullreq I made. Should we ignore the spec about parallelizing feed operators?

lizmat commented 5 years ago

FWIW, I don't think feeds need to create containers, so we can have that performance benefit. It's only the storing in the endpoint that should create containers if the receiving wants that (e.g Array vs List).

Kaiepi commented 5 years ago

Disregard what I said about ignoring the spec, I figured out how to get parallelized feed operators to run 5x faster than the current implementation

Kaiepi commented 5 years ago

Before I can continue with my pullreq, there's something that needs to be resolved. Modules in the ecosystem are using feed operators with things that aren't iterable. Here's an example from CUID:

sub timestamp {
        (now.round(0.01) * 100)
        ==> to-base36()
        ==> adjust-by8()
        ==> padding-by8()
}

Should this behaviour be preserved?

lizmat commented 5 years ago

Does that currently return an array or a scalar?

Kaiepi commented 5 years ago

A scalar

lizmat commented 5 years ago

Then I think a nqp::p6store will take care of that eventuality.

jnthn commented 5 years ago

Before I can continue with my pullreq, there's something that needs to be resolved. Modules in the ecosystem are using feed operators with things that aren't iterable.

My feeling is that any function you feed a value into had better be happy with getting its input as a final extra Iterable argument (presumably a Seq with an underlying iterator that is pulling from a Channel). Or, once we support it, such an argument at insertion point.

If we've things in the ecosystem that don't play well with that - which I don't believe the example given here will - we may need to preserve the existing semantics for 6.d and below, and introduce the new ones for 6.e.PREVIEW and onwards.

The feed operators really didn't get that much attention to date. The implementation before the recent work was very much a case of "first draft", and certainly didn't explore the parallel aspects alluded to in the language design docs. I'd be surprised if we can make them behave usefully going forward without breaking some of the (less thought out, and probably accidental) past behaviors.

jnthn commented 5 years ago

Also, some notes on the parallelism model with feed operators: it's quite different from the hyper/race approach.

In the hyper/race case, we take the data, divide it up into batches, and work on it. Where possible, for the sake of locality, we try to push a single batch through many operations, e.g. if you do @source.race.map(&foo).grep(&bar).map(&baz) then we'd send a batch, do the maps/grep in the worker, and send back the resulting values. In this model, the parallelism comes from dividing the input data. The back-pressure here is provided by the final consumer of the pipeline.

By contrast, the feed model is about a set of steps that execute in parallel. The parallelism is in the stages of the pipeline being run in parallel, not from the data items. It can be seen as a simple case of a Staged Event-Driven Architecture. Since a given state is single-threaded, it may be stateful - whereas if you try and do stateful things in a map block in a hyper/race it's going to be a disaster. The backpressure model here would ideally be that once a queue becomes full, you cannot put anything more into it. One possible solution here would be to make Channel take an optional bound. Then a send into a Channel that is considered full would block, so you can't put more in, meaning a fast stage can't overwhelm a slow one.

One slightly more general problem is that Channel today doesn't really fit our overall concurrency model very well: it blocks a real OS thread when we try and receive from it, whereas in reality we like non-blocking awaiting of things where possible. I mention that here mostly because I think the stages in a pipeline should be spawned on the thread pool scheduler, but it's quite clear that they won't be the best behaved schedulees with Channel as it exists today. Probably we should solve that at the level of Channel, though, so I'd just use Channel between the stages today. It means we get error and completion conveyance, which are easy to get wrong, so I'd rather not have more implementations of those. :-)

Some problems will be better parallelized with hyper/race, some with feed pipelines, but there's also the issue that some things aren't even worth bothering. I fear the ==> operator is especially vulnerable to that: while I don't think too many folks will write .hyper because it looks prettier, they probably will write ==> for that reason. If we magically speed up their programs with parallelism that's great, but there's a decent chance it won't be worth it, and in fact slow things down. That's a tricky problem, and it's also one we'll have to solve for the hyper/race model too. For now, I'd say just do the parallel implementation, and we'll investigate such heuristics and automatic decision making later. I don't think usage of ==> is widespread enough yet for us to really upset anything

Kaiepi commented 5 years ago

The parallelization part of this is done, all that's left is support for <<==, ==>>, and *. I have a question regarding how <<== and ==>> should work though:

my @foo = (1, 2, 3);
(4, 5, 6) ==>> @foo ==>> my @bar;
say @bar; # OUTPUT: (1, 2, 3, 4, 5, 6)

What should the value of @foo be after running this? (1, 2, 3, 4, 5, 6) or (4, 5, 6)? I think (4, 5, 6) DWIMs better, but I'm not entirely sure.