Background/non-tab-dependent page fetch and analysis logic

poltak commented 7 years ago

Mirrors PR #15 on WorldBrain fork. In short it provides a fetch&analyse logic, as discussed in #28, allowing non-browser-tab dependent fetching and analyse logic.

Creates new background fetch-page-data logic, which mirrors what happens when a user manually visits a page, but does it on a DOM gotten via XHR rather than DOM in-tab.

Updates calling functions in page-analysis/background and page-storage/store-page to default to the new background fetch&analysis if no tabId of a browser tab is supplied (tried to minimise interface changes; just make tabId optional).

poltak commented 7 years ago

since we are changing the current analysis flow to handle two scenarios [...], we may want do a more thorough refactoring. I fear we end up with fragile code if we add many if tabId kind of statements in several places.

Yeah I understand the concern, but I haven't thought of the best way yet.

Looking over the entire two scenarios now, the only differences are:

tab-based scenario stores screenshot; background does not (page-analysis/background/index.js#L46)
each scenario go down different fetch&analyse routes (the very next line). Both of these fetch&analyse call the same extract-page-content eventually, they just differ in the way they get the DOM/Document (from browser tab or over network via XHR)

Given they are the only differences I can think of, at the moment, I don't think we should try separating the code, but maybe just making different entry points?

I cleaned up the conditional tabId code (and stopped passing in of different extract-page-content callbacks to the common fn) in page-analysis/background in 91813eb6f947e2a2cef5f57118092f41c22c6117. Now calling either entrypoint fn will result in the same thing, just using different methods.

If you think that's on-the-right-track the next step would be to do the same with page-storage/store-page (one step "back" in the scenario flows, and currently acts at the entry point to all this logic -- from log-page-visit at least).

So for example, turn:

reidentifyOrStorePage: ({ url: string, tabId?: string }) =>
    Promise<{ page: IPage, fetchFinalPage: () => Promise<IPage> }>

into:

reidentifyOrStorePageInTab: ({ url: string, tabId: string }) =>
    Promise<{ page: IPage, fetchFinalPage: () => Promise<IPage> }>

reidentifyOrStorePageInBackground: ({ url: string }) => 
    Promise<{ page: IPage, fetchFinalPage: () => Promise<IPage> }>

Maybe should change their names as well...

Treora commented 7 years ago

(pulling this general conversation out of the in-line discussion as that may become hidden at some moment)

in your mind, what does each process consist of?

Browser history import would do what the import-history I drafted does: import browser history, convert and store it as pages and visits. The batch-fetch process would download docs and also add them as pages (now with more content) to the database. Either can be invoked without the other (we could pass the batch-fetch process any list of URLs).

Note this all just concerns the technical design. The UI could hide the separation, and run the history import and start bath fetching in direct succession.

I imagine that starting batch fetching could mean just adding a bunch of URLs to a list in the database, which the background process reads and processes as a queue whenever it feels like (possibly only during inactive & power-plugged moments).

Treora commented 6 years ago

Closing this old PR. This might actually be an interesting feature still, but it would have to redesigned&rewritten if it gets attention again.

WebMemex / webmemex-extension

Background/non-tab-dependent page fetch and analysis logic #97