Add MappingIterator for synchronous transformations

jeswr commented 2 years ago

An alternative to #58 which uses callbacks to close the iterator rather than creating a new class.

coveralls commented 2 years ago

Coverage remained the same at 100.0% when pulling 3be4178f2fa0780b577016c40b2a178744a37691 on feat/faster-transforming-no-limit-class into a536bca2dfe84697e1a095c10eb1fcce40f7182c on main.

jeswr commented 2 years ago

@RubenVerborgh When I run it locally node 10 coveralls check is complaining about a comment in line 38 - so not sure how to resolve that one...

RubenVerborgh commented 2 years ago

Rather than close, how about we call the passed map function with (item, iterator) as arguments? Then limit can close the iterator itself, and we don't need an exception for it.

jeswr commented 2 years ago

This is ready for review @RubenVerborgh - given that this is now much cleaner than #58 I suggest that we merge the branch more-or-less in this state and worry about possible performance tweaks in the order of 10-20% at a later date.

RubenVerborgh commented 2 years ago

Excellent. Thanks, @jeswr; I can work with this!

RubenVerborgh commented 2 years ago

@jeswr Thanks for your patience. Gave this a deep review.

This is almost good to go, except the readable getter, which uses a trick that will get us into trouble. The readable mechanism is quite intricate, and is also used to fire the readable event at the correct times. So we really need something that sets this.readable = true when the source becomes readable, similar to how TransformIterator has source.on('readable', destinationFillBuffer).

As a fun challenge (and looping in @joachimvh), I gave up on the typing of private readonly _mappings: MapFunction<any, any, MappingIterator<S, D>>[]; which should really be private readonly _mappings: MapChain<S, D> where MapChain<S, D> is recursively defined as [MapFunction<S, D>] | [MapFunction<S, X>, ...MapChain<X, D>] with some infer X magic. However, I couldn't do it. Not a necessity though 🙂

RubenVerborgh commented 2 years ago

Sorry for the haphazard comments, @jeswr! They may look random, but it's actually three separate waves of working on this, and some crucial thinking work in between.

However, this is the conclusion I've come to:

Let's first merge this without the repeated .map optimization.
- That will allow us to get readable right.
- There are a couple of conceptual difficulties with the combined mapping from the root. One of them is the exposure of the iterator argument in the map callback. When you compress multiple mappings into one, this argument becomes meaningless/powerless: calling .end on these intermediate iterators becomes hard to define.
We should not expose the iterator argument in the map function, as this excludes certain optimizations in the future.
- We could define a special return value (a Symbol or so) that means "end the iterator", just like null now means "don't emit this item". I don't think that's a great solution though in general, to have such special return values. Or we could throw a special kind of error. But still not my favorite.
The previous point probably also means we need a different take iterator; I propose we just inline the solution of #58 (but not have a separate HeadIterator, that seems overkill).
In this simpler implementation, let's focus on getting all edge cases right.
- For instance, the source should not be read from if the source if closed, or if the mapped iterator is closed.
- And we'll need tests for all these (similar to TransformIterator).

I know some of the above go against my earlier suggestions, but reviewing the resulting code in depth brought new insights.

jeswr commented 2 years ago

I haven't fully context switched back to this issue so apologies if any of these q's/notes are obvious or obviously incorrect.

For reference here is the original state of the PR when it was opened which I will also refer to a few times below.

Firstly I just wanted to make a clarification that I suspect answers this comment. The source is what you refer to as root in this description. The mappingRoot (previously called upstream) would be the iterator constituting root.map(f1).map(f2) if we are considering the iterator for root.map(f1).map(f2).map(f3) as our 'current iterator'.

The original PR also did not contain constructor overloads so I'll need a little more time to understand what has been done there before I can comment on whether there is indeed a bug now.

This is almost good to go, except the readable getter, which uses a trick that will get us into trouble.

I'm still not convinced that this was a problem when I started this PR. The logic flow I was following is that when .readable is set to true via a getter, and in turn set true on the source (i.e. root) as is done in my original version of the PR, the consequence of this is that the root becomes readable and emits a readable event. Then as a result of this line the iterator for root.map(f1) would emit readable, and in turn root.map(f1).map(f2) would emit readable and finally root.map(f1).map(f2).map(f3) would emit readable.

This ensures that readable events are fired when required, whilst also ensuring that an iterator is readable if and only if (at least while neither iterator is closed) the source is readable which is to be expected when doing synchronous transformations.

Since the setter has since been removed from the original PR there will indeed be a bug around readable events.

As for the issue around ending, this was the reason for checking whether the state is CLOSED or not for readability. This means that if the root has ended then root.map(f1), root.map(f1).map(f2) and root.map(f1).map(f2).map(f3) can never be readable again - and also this line ensures that each of the downstream iterators have the end event called on them. Furthermore, similarly to the readable case, if the iterator corresponding to root.map(f1).map(f2) were ended then only it and root.map(f1).map(f2).map(f3) would be ended.

We should not expose the iterator argument in the map function, as this excludes certain optimizations in the future

What kind of optimisations does this exclude? Importantly, can they be achieved if we just pass a close callback rather than the full iterator (this was the set-up when the PR was originally opened). A close callback would also avoid the need for a custom take iterator.

RubenVerborgh commented 2 years ago

I haven't fully context switched back to this issue so apologies if any of these q's/notes are obvious or obviously incorrect.

Yeah, sorry that I had to leave this one for a while. I think it's the last thing I worked on before I caught covid, so I've never been able to complete my original review. Back on things now though!

The original PR also did not contain constructor overloads

The only thing this does is allowing people to pass in a single mapping function, so it not needing to be a one-element array. The array constructor is considered private, given that it is an internal optimization; so I essentially split the two cases (external versus internal construction).

I'm still not convinced that this was a problem when I started this PR. The logic flow I was following is that when .readable is set to true via a getter, and in turn set true on the source (i.e. root) as is done in my original version of the PR, the consequence of this is that the root becomes readable and emits a readable event.

Right. I think what happened is that I started adjusting the unit tests a bit, and noticed weird failures. So the unit tests now are correct, but I believe they don't work with the original PR because of readable events (some iterators were hanging).

My memory is that I fixed the readable to the point the tests were running, but I knew there would be other problems, hence the comment I wrote. For instance, given my (this._state < CLOSED) && edit, I presume that readable events were being fired at times where this was not possible. So even though my fix worked, I could think of other scenarios that would go wrong.

The main challenge with AsyncIterator is correctness; it's sometimes quite hard to reason about edge cases. All tests might work, and suddenly 1 out of a 100 times a 1 million stream will stall and it will be impossible to figure out why. So that's why I really suggest sticking to the existing readable mechanism, that has all the state checks etc., and has all the corresponding order-based unit tests attached to it. Or, if overwriting the mechanism, then a whole bunch of new readable tests will be necessary (and they will surface a couple of bugs).

Since the setter has since been removed from the original PR there will indeed be a bug around readable events.

Yes, but having the setter might give rise to a whole other range of problems. The readable setter was actually supposed to be private, but TypeScript doesn't allow such an asymmetry. Messing with the state of other iterators can be tricky.

I think if you put the setter back, you will see test failures.

What kind of optimisations does this exclude? Importantly, can they be achieved if we just pass a close callback rather than the full iterator (this was the set-up when the PR was originally opened).

Whether we pass an iterator or a close method tied to an iterator, it means that there is always the concept of an iterator at that stage. I.e., the map function then has knowledge about the execution. If we take a shortcut, where actually one iterator performs the work of 3 others in a single go, then things get confusing. So I don't think we should pass anything.

A close callback would also avoid the need for a custom take iterator.

Indeed, but at a cost. Perhaps it does make sense to have an UntilIterator or so, which takes a function as an argument that, when true, stops iteration. Then we have generalized stopping.

The main question is how we want to approach this PR now, I realize it's very inconvenient that, due to circumstances, there were 2 months between your and my initial work on this, and today. It's not easy to get the context back from then.

Do you still have the bandwidth to work on this? Just because you were volunteering back then, does not mean you still have the time today, and I want to be respectful of your contributions.

A way forward that I see at the moment is, given the current unit tests (which are a great harness to have), to re-envision the implementation a bit. I.e., if this were myself working on it, I would start with the existing code from the main branch, and make it work with the unit tests from this branch by incrementally copying in the code from this branch. I'd get a stable version without the sequential map optimization, merge it, and then add the optimization as needed.

Let me know what you think and what's feasible for you. In any case, thanks a lot for working on this.

jeswr commented 2 years ago

Do you still have the bandwidth to work on this?

Yep - I can have a go today branching off main, use the current test harness, and then follow the incremental approach suggested. The responses above appear to provide some very useful explanations that I can use to help inform that design. If after that there are still some conceptual problems then it might be faster for you to take over rather than having to review several more iterations of my code - but will see how I go today before making that call.

thanks a lot for working on this.

No worries! I always enjoy working on OS code and I've learned a lot from the work and reviews. Thanks for all your work on the reviews!

RubenVerborgh commented 2 years ago

The responses above appear to provide some very useful explanations that I can use to help inform that design. If after that there are still some conceptual problems then it might be faster for you to take over rather than having to review several more iterations of my code - but will see how I go today before making that call.

You're absolutely nailing it in #75, so let's finish things there 👍👍

RubenVerborgh / AsyncIterator

Add MappingIterator for synchronous transformations #59