Generic promise batching implementation

poltak commented 7 years ago

Mostly relates to what is being discussed in #28 regarding batch running of a fetch&analyse function. Work on that function is being done elsewhere, so instead this is a generic batching logic that will be able to be used with fetch&analyse function (and any other async function).

More information (from original discussion on WorldBrain PR #15 - when batching logic was coupled to fetch&analyse):

Spent the last day getting familiar with RxJS usage. Used elsewhere in this project and heard about it enough lately to prompt me to play with it. Interesting stuff. Anyway, I got the idea of maybe trying to write the batch fetch&analyse logic using an Observable.

This may be completely the wrong direction, as my understanding may be very flawed at this point, however I managed to get something working that was surprisingly simple (code-wise; still took ages to get working) and works (at a basic level so far).

Pre-knowledge

The previous work on this PR essentially resolved in: store-page can now work with and without a browser tab. Meaning you can just pass a URL and have all the dedup, data fetching + analysis logic run on it. The store-page interface takes url and returns a Promise. This Promise eventually creates a page stub and resolves to an object containing:

page (contains the _id of the inserted page stub + url)
finalPagePromise (another promise that does all the analysis and updates the page stub, if needed)

The first promised returned by the store-page interface usually resolves pretty fast, as it's just doing a DB op. The finalPagePromise usually takes a bit longer, as it requires doing an XHR and some other stuff.

In my mind, the batch/whatever should be able to take a bunch of URLs and then batch up these finalPagePromises, waiting on n at any time. It should also be able to stop anytime and pick-up again from where it left off.

Idea

The native ES6 Promise (and asyncawait) API doesn't seem powerful enough on its own to do this without a lot of extra code. However I found mergeMap in RxJS that seems to be able to do this coupled with defer to stop all the promises from running at the same time (limits to the concurrency level specified).

On a stream of URLs, mergeMap takes each one and defers the finalPagePromise returned from calling that URL on the store-page interface. Concurrency level will allow n promises to run at anytime while the others are still deferred until the currently waiting promises get resolved.

An internal state is stored, which currently URLs get written to as they get pushed off the mergeMap stream, meaning their finalPagePromise has resolved. This allows restarting of the batch as long as the state is stored somewhere (either internally in the batch function, or externally if the function needs to be free'd).

Confused part

Because the store-page interface returns a promise which resolves to a bit of data + another promise, I am currently doing some weird stuff to get that inner promise to make sure the Observable.defer gets it (instead of outer promise). It means I can't emit other information (like the page ID or URL) to the streams (or at least I don't understand how). example:

observable.mergeMap(
    input => Rx.Observable.defer(
        () => returnsNestedPromise(input).then(
            res => res.innerPromise)))

Example

Here's an example CodePen I was playing with to understand everything. It runs the same batching logic, but with a mocked fetch&analyse function, that also logs everything to console to see how it works: https://codepen.io/poltak/pen/XRRdeR

Treora commented 7 years ago

Nice challenge, to make a pausable promise executor using Rx. It is ideal if the complexity is contained in the file while the outside interface is simple. I wonder how exactly this would be used in the application.

I think you took out the struggle with the promise-in-a-promise construct? This whole construct (that I made for deduplication) is not so pretty and could be changed; actually an Observable may be a cleaner way for such a twice-resolving promise; or the finalPagePromise may not be needed at all (I suppose we could listen to database changes instead).

I can't yet fully wrap my head around it though.

poltak commented 7 years ago

I think you took out the struggle with the promise-in-a-promise construct? This whole construct (that I made for deduplication) is not so pretty and could be changed

I've left it as is right now, but I'm thinking instead of returning the finalPagePromise, return a function that returns finalPagePromise. The reason is that Promises will start to process as soon as they're created, and enclosing the creation in a function will essentially defer the Promise until it's called. So I can call a Rx.Observable.defer on that function, but if I try to defer on the Promise that's already created, it may be already resolved by the time it gets to that part of the stream (making the idea of "promise-batching" useless, as there's no control over which promises are allowed to resolve at any time). Not sure if that makes sense...

actually an Observable may be a cleaner way for such a twice-resolving promise

Added optional support for these cases in 0dc70b7. Still has main callback arg which sets up the async function to batch, but if that returns another async func/Promise, it'll be handled. (In Rx, need to use .flatMap operator rather than .map for these cases).

WebMemex / webmemex-extension