Closed jacoscaz closed 2 years ago
I should add that this PR is not ready to be merged as it's lacking tests and I'm sure the code can be improved. However, it is ready for a first round of feedbacks to better direct my efforts. Tagging @jeswr , @rubensworks and @RubenVerborgh to keep everyone in the loop. BTW, this would have to be rebased on main
once #59 gets merged.
Things become much more interesting when we consider what happens with a chain established by wrap()ing:
I've also seen this pattern in the core of Comunica (with a few synchronous transforms in between), mainly in the context of handling iterators that are wrapped in a promise. It would be good to have benchmarks on that case as well - i.e.
let i = 0;
const arr = new Array(200_000).fill(true).map(() => i++);
const source = new ArrayIterator(arr);
const opts = { letIteratorThrough: false };
const then = Date.now();
wrap(Promise.resolve(wrap(Promise.resolve(wrap(Promise.resolve(wrap(Promise.resolve(source), opts)), opts)), opts)), opts).on('data', () => {}).on('end', () => {
console.log('elapsed', Date.now() - then, 'ms');
});
@jeswr I've tweaked your test a bit to ignore the time spent establishing the chain, just in case it could make a difference (likely insignificant, but still).
let i = 0;
const arr = new Array(200_000).fill(true).map(() => i++);
const source = new ArrayIterator(arr);
const opts = { /* letIteratorThrough: true */ };
const chain = wrap(Promise.resolve(wrap(Promise.resolve(wrap(Promise.resolve(wrap(Promise.resolve(source), opts)), opts)), opts)), opts);
chain.once('readable', () => {
const then = Date.now();
chain.on('data', () => {}).on('end', () => {
console.log('elapsed', Date.now() - then, 'ms');
});
});
This test runs in ~450 ms with current main
and ~12 ms with this PR. letIteratorThrough
makes no difference as having to return an iterator synchronously forces the use of WrappingIterator
even for AsyncIterator
sources as we still need to wait for the promise to asynchronously resolve.
Not sure about what's going on with coveralls but this is probably ready for review by @RubenVerborgh (in due time, after #59 is merged). I'll resume working on cloning.
@RubenVerborgh I have just rebased this one on main
. I'm not sure as to what is happening with the failing test on Node 10.x but I would prefer if you were to take a look first before doing further work on this.
Thanks, removed the second option 16fcfb638a97df3b26cb11aa3d4da060b120c146 and will talk to Jesse over lunch about the other one!
Awesome!
Thanks, removed the second option https://github.com/RubenVerborgh/AsyncIterator/commit/16fcfb638a97df3b26cb11aa3d4da060b120c146 and will talk to Jesse over lunch about the other one!
The request for that was in relation to performance when wrapping N3.js.
By default wrap(store.match(...))
would end up using fromAsyncReadable
which in turn uses buffering within the WrappingIterator
(and all results get buffered into an array as a result of this). As opposed to just using the IterableWrapper
which is much faster as it just pulls out quads on-demand. IIRC we cannot change the default prioritization order for backwards compatibility reasons.
The main consequence of removing this is that downstream clients like Comunica will sometimes need to write
if ((isIterable(source) || isIterator(source)))
return fromIterable<T>(source);
else if
return wrap<T>(source);
rather than
return wrap(source, { prioritizeIterable: true })
Thanks @jeswr; I'll take a look at the backwards compat thing too then. But we should def put performance-friendly options first. (Also this will be one of the last changes before semver.major, so then we can happily break.)
Thanks @jeswr; I'll take a look at the backwards compat thing too then. But we should def put performance-friendly options first. (Also this will be one of the last changes before semver.major, so then we can happily break.)
Thanks @jeswr for the refresher. I hadn't realized we already have a use case for the prioritizeIterable
flag. Long-term, I suspect more such flags might follow (e.g. for native async iteration). Perhaps an explicit ordering might be better, here? Something like priority: ['iterable', 'asynciterable', 'array', ...]
?
@jacoscaz @jeswr Done my in-depth review—what do you think?
I like where this ended at! Thanks @RubenVerborgh , +1 from me to merge and I’ll try to rebase #78 post-merge as soon as possible so that it can also be merged, completing work on wrapping.
A major thank you to @jacoscaz as well as @jeswr for your efforts and patience.
I realize it has taken me a lot of time to review and complete this PR, so I definitely wanted to emphasize my appreciation for helping out with open source.
Because of its precise and high-performance nature, AsyncIterator is a special project in the sense that every PR tends to require one uninterrupted day for me to work on it. There's a lot of thinking work on finding which code makes it easiest to reason about correct behavior in a wide array of edge cases. So if you find me replacing or rewriting large chunks of code you have contributed, it's not you—it's me! Basically the hard work is finding the initial code that works, and covering it with an appropriate set of extensive tests, which is what you've done. Then it's my job to re-interpret the code that satisfies that those tests, and that's what tends to take me quite some time to get right.
Thanks for bearing with me, and I look forward to getting to the next PRs!
Published as v3.6.0 🥳
Because of its precise and high-performance nature, AsyncIterator is a special project in the sense that every PR tends to require one uninterrupted day for me to work on it.
@RubenVerborgh thank you for this library in the first place and all the effort that goes into maintaining it! I enjoy your reviews as they always lead to objectively better code from a correctness / performance standpoint. I also enjoy seeing the same code rewritten in a style that is not the one I started out with - there's always a great deal to learn.
This PR covers step 2. in the plan described in https://github.com/RubenVerborgh/AsyncIterator/issues/44#issuecomment-1086950726 and aims to 1) lower the overhead of the
wrap()
function and 2) expand support to additional types of sources (namelyIterator
andIterable
objects).The rationale for this PR is taking the onus of figuring out how to best wrap common types of data sources away from users of this library. Users should be able to expect
wrap()
to use the strategy that is most appropriate for any given source (as long as it's a supported source).Performance-wise, this PR does two things:
1) It introduces source-specific iterators that synchronously read from their sources, skipping the internal buffering done by
TransformIterator
(i.e. the class currently used for wrapping)2) It introduces a specific option that allows
AsyncIterator
instances to fall-through as they are, with no wrapping appliedThe following test, admittedly naive, takes ~116 ms with the current
main
branch and ~8 ms with this PR (letIteratorThrough
makes virtually no difference). This mostly highlight the difference made by skippingTransformIterator
's internal buffering.Things become much more interesting when we consider what happens with a chain established by
wrap()
ing:The above test takes ~250 ms with the current
main
branch, ~ 15 ms with this PR andletIteratorThrough: false
, ~ 8 ms with this PR andletIteratorThrough: true
. I haven't seen consistent differences with or without warmups.My use case is the optimization of the interface between quadstore, which implements the RDF/JS
Source
interface, and comunica, through which the former can be queried via SPARQL: https://github.com/comunica/comunica/blob/a2f31270778730911bd0bdf8b8c7ade567153319/packages/actor-rdf-resolve-quad-pattern-rdfjs-source/lib/RdfJsQuadSource.ts#L29 .As many algebraic operation (matching, joining, ...) are happening in-memory, thus requiring the streaming of large amounts of quads from the persistence layer into the engine, removing even just one layer of indirection/buffering can make a significant difference.