Closed jeswr closed 2 years ago
Just came back from a break week with no laptop and catching up on things. When it comes to the behaviors described in https://github.com/RubenVerborgh/AsyncIterator/pull/79#discussion_r934927797 and https://github.com/RubenVerborgh/AsyncIterator/pull/79#discussion_r934928512, I don't think that those can and/or should be addressed in UnionIterator
as the latter can not make any assumption on how sources are generated in the first place.
In the case of sources being lazily generated in an asynchronous manner, ensuring correct error propagation and disposal of sources would be best addressed in a dedicated sources iterator that is bound to be context-specific. The only thing that UnionIterator
should do is to catch errors both at the single source level and at the iterator-of-sources level (which is already happening at https://github.com/RubenVerborgh/AsyncIterator/blob/6e7f093364bef006db6b06136aeb80eafce82d64/asynciterator.ts#L1668-L1673).
IMHO, when it comes to error propagation and source disposal this PR is already complete as it is.
In the case of sources being lazily generated in an asynchronous manner, ensuring correct error propagation and disposal of sources would be best addressed in a dedicated sources iterator that is bound to be context-specific.
So then I guess the only remaining question for this PR is when an array of sources is passed into the UnionIterator
whether we should immediately start listening for the error event by subscribing that listener in the constructor.
So then I guess the only remaining question for this PR is when an array of sources is passed into the
UnionIterator
whether we should immediately start listening for the error event by subscribing that listener in the constructor.
I don't think so as that would lead to slightly different behaviors in the case of sources passed as an array vs. sources passed as an iterator. Wrapping the array into an iterator with wrap()
and consuming it as if it had always been an iterator leads to more consistent behavior.
Ok great - in that case I am happy with the status of this PR and consider it ready for review :)
Actually, I realized you were likely making a point that I failed to grasp. If sources are passed as an array of iterators that is fed to wrap()
, the resulting iterator will not relay errors from the sources themselves as there's nothing in wrap()
that handles that specific case (rightly so as wrap()
should not modify its behavior based on item type).
In this case I would either leave things as they are right now or create a different iterator class that is specific to the use case of abstracting an array of sources into an iterator. Then, the UnionIterator
constructor could do something like:
if (Array.isArray(sources)) {
this._sources = new SourceArrayIterator(sources);
} else { ... }
However, the implementation of SourceArrayIterator
would be a non-trivial one:
1) should there be an option to simply discard errored sources? 2) should an errored source automatically lead to the destruction of other sources?
These questions seem very context-specific to me and I'm sure I could come up with plenty more. Perhaps it's worth thinking of SourceArrayIterator
as an orthogonal feature to UnionIterator
and better addressed in a different PR if needed at all.
Perhaps it's worth thinking of SourceArrayIterator as an orthogonal feature to UnionIterator and better addressed in a different PR if needed at all.
So to conclude this PR is ready to be reviewed in its current state.
Bump @RubenVerborgh
@jeswr Some questions about the circular list:
CircularList
to a separate file like LinkedList
?CircularList
strictly needed? I.e., do the performance gains really come from not having an array for the iterators?I do agree that, conceptually, it is indeed a circular list though.
Given that the number of iterators is assumed to be small (10–20ish?) and not often changed (?), I am curious if there is a big impact.
reasons for a major version bump
So this is a v4? (I followed the links, but no result for keyword major.)
Shall we extract
CircularList
to a separate file likeLinkedList
?
That's probably sensible. When I was first writing I didn't have clarity on what APIs would need to be exposed; but I can now try and refactor that retrospectively.
Is a CircularList strictly needed? I.e., do the performance gains really come from not having an array for the iterators?
The reason I have introduced it is that we need to always be able to delete an element in O(1)
time when the end
event for that particular iterator is called; which we cannot do with an array or the current LinkedList
as both require the traversal of O(n)
elements (where n
is index of the element in the array/linkedlist).
The other type of data structure that would allow for this type of deletion would be a Doubly LinkedList, which could become part of the current linkedlist implementation, at a memory cost to other iterators.
So this is a v4? (I followed the links, but no result for keyword major.)
Yes, since the erroring behavior will change in some cases this will need to be a v4.
both require the traversal of O(n) elements
I agree; but it seems to be that it would be O(10)
at most; unless we are looking at scenarios where the number of iterators is really large?
I guess I'm simultaneously asking: what are the main sources of performance improvement in this PR? Is it indeed this new list structure?
I agree; but it seems to be that it would be O(10) at most; unless we are looking at scenarios where the number of iterators is really large?
Comunica has some cases where the number of iterators in a union could become very large (property paths and link traversal). However, currently those cases do not make use of the UnionIterator yet, so perhaps not necessary to optimize for those.
Okay; I'm happy to have a circular list.
(Although, side-note, what we actually need is a list with O(1) item removal. Which can perfectly be a LinkedList
with lazy deletion: you keep a Set of items to be deleted, and as you iterate through elements, you remove nodes as needed. But we might be getting into premature optimization territory here 😅.)
(Actually, scratch that. There might be multiple occurrences of an item—at least for lists in general.)
(But do note that you don't need a doubly linked list. All we need is a Map from item to previous list node; which is currently implemented via Symbol. Also the current implementation breaks if an element is added multiple times to a list; which is not allowed for iterators, but it is for lists in general.)
(So my points still hold if we simply don't allow duplicate items.)
Maybe OrderedSet
extends LinkedList
, adding O(1) has
and delete
operations via a Set or Map?
Then we also keep the LinkedList
without the Map memory and perf overhead.
And there could be a .looped
iterator that is infinite/circular!
Maybe OrderedSet extends LinkedList, adding O(1) has and delete operations via a Set or Map?
I'll have a crack at this in the morning (jet lag is starting to hit me hard). To confirm - this would be preferred over a circular linked list implementation (provided this has similar performance to the current implementation, which it should)
this would be preferred over a circular linked list implementation
Not necessarily; my remark was more that all we need is O(1) deletion.
Circular lists as such doesn't offer that; it's first an O(n) lookup, and only then deletion is O(1). Which is why you added the Symbol lookup to make it O(1).
So what I'm arguing is that the Symbol lookup trick is basically what we need (and that we can also implement it with a Map); and that we get O(1) deletion by making it point at the previous node rather than current node.
Yes, since the erroring behavior will change in some cases this will need to be a v4.
Could you details these a bit?
Because we should line up all the changes we want for v4, and see which ones we can bundle.
Could you details these a bit?
See https://github.com/RubenVerborgh/AsyncIterator/pull/79#discussion_r934929002. The reason that test broke is because we do not subscribe to the error event of iterators until after we start reading from them. So if the iterator errors before we first call .read()
on it, then the error will not be forwarded.
Though, if we remove pretty well all autostarting behavior in #45 - then there really shouldn't be errors being emitted prior to the first .read()
call anyway.
we do not subscribe to the error event of iterators until after we start reading from them
Can't we change that behavior? Not just for backwards compatibility reasons, but it's a necessity for correct functioning.
The moment an iterator is handed over to a component, that component takes full responsibility of that iterator. That includes handling its errors, which would terminate the process when unhandled (default EventEmitter behavior).
So in this concrete case, when a source iterator is handed over to a union iterator, that union iterator should immediately attach error listeners.
Can't we change that behavior?
Just to make this even more concrete; if I have
const source = range(1, Infinity).map(x => range(1, 10))
const unionRange = union(source)
Then source
will have an error listener attached immediately, but each of the infinitely generated range(1, 10)
iterators will not have an error listener attached until immediately before the first .read()
call. In this case, and cases with a large number of generated iterators it does not make sense to try and retrieve them all from the source
and attach error listeners before we start being able to iterate over unionRange
.
This is still breaking relative to the last version as the fact that the UnionIterator
could autoStart and buffer could mean that listeners are attached at an earlier point in time.
The one case in which it may make sense to attach error listeners to all source iterators in advance is if we are taking a union over an array of iterators rather than an iterator of iterators; i.e.
const unionRange = union([range(1, 10), range(1, 10)])
But in the same token; I'm not sure if it makes sense to change this behavior when taking the union over an iterator vs. the union over an array.
The only way to attach error handlers to a union of lazily-generated sources would be to let upstream instantiate those sources and buffer all of them, even though we might need only a few such as in the case of destroying the union after a set number of items. That would likely have significant performance and resource utilization issues.
But in the same token; I'm not sure if it makes sense to change this behavior when taking the union over an iterator vs. the union over an array.
This is why I suggested using a separate class to handle that specific use case in https://github.com/RubenVerborgh/AsyncIterator/pull/81#issuecomment-1216802882 - to keep the UnionIterator
class as consistent as possible.
Closing as I have a better implementation in the upcoming v4 PR.
Supercedes #79
The concerns about behavior (and reasons for a major version bump) that I pointed out there still remain. But not sure if I have a good solution for those at this point.
Before For loop with 5x10^9 elems: 5.207s UnionIterator 2x10^7 iterators: 30.297s UnionIterator 1000x500 iterators: 1.897s UnionIterator 1000x500 iterators - max parallelism of 1: 1.837s
After For loop with 5x10^9 elems: 7.832s UnionIterator 2x10^7 iterators: 3.844s UnionIterator 1000x500 iterators: 276.211ms UnionIterator 1000x500 iterators - max parallelism of 1: 157.52ms