WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
271 stars 18 forks source link

Lazy crawler+bundler implementation #44

Closed Gozala closed 2 years ago

Gozala commented 5 years ago

Proposed changes introduce concept of Resource which on instantiation is just a pointer to the corresponding sub-resource link (where link is what it used to be). Each resource is a lazily freeze-dried on demand:

Primary API has also changed from freezeDry(opts):Promise<string> to freezeDry(opts):Resource there for to accomplish what old API did user will need to await freezeDry(opts).text() instead. This change also means user could do await freezeDry(opts).blob() or in more advanced use case choose to hand roll alternative serialization:

const root = freezeDry(opts);
const resources = await root.resources()
for (const resource of resources) {
  switch (resource.type) {
    case "image": 
    case "audio":
    case "video":
    case "font": {
      const content = await resource.dowload()
      const blob = await content.blob()
      const url = await blobToDataURL(blob)
      resource.replaceURL(url)
      break
    }
    case "style": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // .... probably recurse here
      }
    }
    case "document": {
      const subresources = await resource.resources()
      for (const subresource of subresource) {
        // ....
      }
    }
  }
}
Gozala commented 5 years ago

Hmm... So it seems that current implementation of freeze-dry actually carries on crawling / inlining resources it's just it will also start serializing DOM so I would guess on timeout you will wind up with some resources inlined and others stay intact. There are also few TODOs to abort tasks on timeout and to do something about resources that have not being processed.

I think it might be worth discussing what would be a good solution here, I don't think replicating current behavior would be ideal, although it is possible it would introduce some messiness to otherwise (IMO) well structured code.

Gozala commented 5 years ago

(Mostly noting this so I don't forget) I think I have an idea for a way of dealing with timeouts. We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

Gozala commented 5 years ago

@Treora I'm having trouble upgrading jest-fetch-mock to latest version that should have support for AbortSignal, it seems that everything breaks after upgrade.

I did notice that after update await (await fetch(docUrl)).text() is [object Object] instead of markup, which I suspect is the reason. I also noticed that fetch.Response and Response used in mockFetch aren't equal and that makes me think it's related.

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

Gozala commented 5 years ago

I have figured the source of the issue. It's related to https://github.com/bitinn/node-fetch/issues/392#issuecomment-483333843:

New jest-mock-fetch uses cross-fetch that in turn uses node-fetch that has it's own Blob implementation that is incompatible with one that jsdom expects, which basically ends up calling blob.toString() and that's where [object Object] comes from.

Unfortunately I can't see any way to resolve this without mass-forking the whole dependency chain in attempt to resolve this.

jimmywarting commented 5 years ago

jsdom was unfortunately unwilling to use the new Public blob API to read the content of the blob with promises.

Maybe you have to use jsdom's own fetch implementation instead of node-fetch (to get there internal blob)? the alternativ is if you can: avoid creating a res.blob() and get it as arrayBuffer using res.arrayBuffer() directly instead of calling something like fr.readAsArrayBuffer(blob)

the other way maybe could be to get it as a arraybuffer and then construct a blob using jsdom if you really need a blob

res.arrayBuffer().then(ab => {
  new Blob([ab])
})

Or try to convince him to use the FileReader to allow reading jsdom like object with the public api - i have already tried and done my part

Gozala commented 5 years ago

I end up resorting to a horrible hack to get jest-mock-fetch > cross-fetch > node-fetch interop with jsdom. I wish there was a better way but there isn't suggestion above assumes we control the full stack, but we don't.

Treora commented 5 years ago

I'm also not sure where Response and Blob globals are coming from in first place, would you mind helping with that ?

As you probably found out already, these are set by jest-fetch-mock.

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

We could use AbortController that can be passed to each fetch that will be issued by dreezeDry and used to abort all of them as necessary.

Sounds good. Also, something signal-based approach like AbortController seems a good API we could offer instead of the timeout option. (a user could just use setTimeout themself to still get the timeout behaviour)

Gozala commented 5 years ago

I have not fully grasped the issue you are dealing with (and would be very glad if I do not need to); but if you found some changes are required in the test setup etcetera, of course feel free to propose. I'm quite indifferent about what test frameworks and such we use; as long as not too much effort is wasted maintaining them. I hope the hacks can mostly be abstracted away from sight, and added in such a way that we can easily remove them if the problematic libraries will have solved issues underneath..

Problem is version of jest-fetch-mock did not had support of AbortController. New version does, but it also switches underlying implementation with cross-fetch which uses node-fetch in node. After updating jest-fetch-mock all tests were breaking, because global.Request / global.Response are from node-fetch while global.Blob is from jsdom worse yet Response constructor does not understand jsdom.Blob and just serializes it to [object String], furthermore Response(...).blob is a Blob instance from node-fetch that JSDOM does not understand.

This means even if we managed to make this.Blob be the one from node-fetch (which we can't because maintainers of this library don't want to expose it https://github.com/bitinn/node-fetch/issues/392), we would still have an issue with passing Response's back to JSDOM world because when it attempts to use response.blob() will not be JSDOM.Blob but would be NodeFetch.Blob instead.

Ideally differences here would be resolved across jsdom and node-fetch by eliminating competing Blob implementations and just create a common one that both would use (https://github.com/jsdom/jsdom/issues/2555#issuecomment-483384509). Until that happens I end up making a hack that changes JSDOM.Blob.prototype.__proto__ = NodeFetch.Blob.prototype adds methods from NodeFetch.Blob.prototype to call corresponding methods on JSDOM.Blob.prototype. Which is horrible, but only way to make blob instanceof Blob work regardless in Blob is from jsdom or node-fetch.

Sadly they monkeypatching can't "be out of sight" that is because NodeFetch.Blob isn't really available and can only be obtained async ((await Response(...).blob().constructor)

Gozala commented 5 years ago

@Treora I discovered one peculiar behavior as I was trying to change subresources from being an Iterable<Resource> to a Map<string, Resource> - that is current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

I would like to address this with an alternate design, where resource:link mapping isn't 1 : 1, but one 1:n. This pull is huge already & it might make more sense to do change to address this as followup instead.

Treora commented 5 years ago

current implementation does not recognize duplicates: link to the same image from both document and styles will end up fetching it twice.

Yes, thus far it seemed best to keep things simple and leave deduplication to the fetch function: the user-supplied fetchResource callback, or even the browser's cache. This could be changed, but would need to be done carefully as a URL does not necessarily always return the same resource: request headers, e.g. the Accept'ed content types, could matter.

Treora commented 5 years ago

This pull is huge already & it might make more sense to do change to address this as followup instead.

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

Treora commented 5 years ago

Problem is version of jest-fetch-mock did not had support of AbortController. ......

Thanks for the clear explanation, that helps!

Gozala commented 5 years ago

Indeed; this already PR is already so large that I suppose the most workable plan will be, when you feel you got to a somewhat stable solution you feel confident about, to try collaboratively review and incorporate various aspects of it in a few iterations. It is very helpful when any orthogonal changes are put into separate commits, or possibly in separate branches.

I already feel like it's done & I can't really do any meaningful progress without a review. So I suggest we do that.

Gozala commented 4 years ago

Hi @Treora,

Any chance we could figure out a way to get some of this work into the tree ? It looks like I'll be needing this in my future work and I would very much prefer not to go on with my own fork.

Treora commented 4 years ago

I am sorry I ended up doing exactly what I intended to avoid: post-poning this work for several months. However, getting back to work on freeze-dry has been slowly rising up on my to-do list. Besides getting this bundling customisability incorporated I figured it may be worth calling it v1.0 and making some better documentation and publicity for it (just a web-page, no road-side billboards). Your reminder may just get me started on this, thanks for poking.

As I think I said before, I will probably not merge this PR as-is, but rather take the architecture and API design you demonstrated here as a guide and make changes in a few steps; while trying to keep the code a bit more modular than this PR.

Also, I am still eager to make the code typed (see issue #28), but might do that afterwards.

Gozala commented 4 years ago

As I think I said before, I will probably not merge this PR as-is, but rather take the architecture and API design you demonstrated here as a guide and make changes in a few steps; while trying to keep the code a bit more modular than this PR.

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

Also, I am still eager to make the code typed (see issue #28), but might do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

Gozala commented 4 years ago

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity. Benefits of this approach would be:

  1. We all get desired functionality sooner.
  2. Big chunk of code will already be typed making followup changes easier.
  3. It will enable us to collaborate - My fork won't need to diverge further and you will be able to land incremental changes to the internal modules.
Treora commented 4 years ago

Is there way we could collaborate ? I do want to use freeze-dry for the work & that would require things like changes in this pull request & more. However I am not sure at this point what would be a best strategy to go about it without diverging further.

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

Assuming you're in favor of public API & risk of sounding biased, I would suggest you to consider landing this without rewriting and then followup witch changing internals to match desired modularity.

I am considering this indeed, but need to look into the code again and fully understand it. As I intend to do a substantial rewrite, and I am already having trouble wrapping my head around some aspects, I suspect that for me it would be much easier and thus quicker to rewrite code I am most familiar with.

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one; I’d like to make the code more ‘graph-oriented’ (perhaps passing subgraphs of subresources&links around), and provide the ‘callback-oriented’ API you created only as a convenience function on top of that. Not sure that makes sense to you, nor if it will keep making sense to me as I dig deeper. :)

Treora commented 4 years ago

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one

I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

Treora commented 4 years ago

Also, I am still eager to make the code typed (see issue #28), but might do that afterwards.

For what it's worth I would advice going about it other way round. Type system pose API constraints that otherwise could be overlooked. Additionally type checker is very helpful in large refactoring as it will identify places you've missed.

Good points. I was thinking that it may slow down rewriting, especially as I am not fluent with typed javascript. Could work out either way, I guess.

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

Gozala commented 4 years ago

As for which type system/language to use; I think you once said you use flow for its more powerful templating system, but that typescript may be helpful because of its wider uptake, community, etc. How is your opinion on this now? I have been a little biased towards adopting typescript, but if you recommend flow I am fine with following your lead on this.

I still think flow has a better & more flexible type system, however typescript is probably a better choice given it’s popularity.

Only argument I would consider is that with typescript you’ll need to do builds etc... In that regard I think flow has an advantage as you can add types as comments (like in this pull) and code would load natively in runtimes that support ES modules.

Gozala commented 4 years ago

As for the API, I am still thinking about making ‘two levels’ of what you have sort of done in one I may be inaccurate here, as you do leave the possibility to interact with the resource graph without using the resolveURL callback (as you exemplify above). I will look into this again and see how much that already matches what I had in mind.

Indeed you can traverse it as a graph, and resolveURL is just a convince kind of like array.reduce but you don’t have to use it.

That being said I find it fairly convenient way to avoid multiple passes over the graph.

I think what you’re disliking about the API is side effect of it being designed to traverse and bundle graph in a single pass, which in turn requires pipelining fetch, parse, remap, save operations such that as you descend it does all of them.

I don’t know if there’s a better way to go about it, but my experience so far had being it allows me to customize behavior very simply

Gozala commented 4 years ago

Alternative API could follow visitor pattern which accomplishes more or less same goals, but is more OOP style.

Current solution allows you to either use pushed base interface via resolveURL hook which is your visitor pattern I suppose but it also allows you to use pull based interface as shown in the example in the description.

Gozala commented 4 years ago

I do not know either what may be the best strategy. I suppose it helps if we exchange more details about our plans; and if we try to explain changes we make/need so they could be redone (‘rebased’, conceptually) if needed.

I am curious, what kind of changes do you expect the “& more” will contain?

I do not exactly know yet, but here is a bit of overview of the context I'm using freeze-dry right now https://github.com/gozala/artifacts

Readme does not quite talks about it yet, but streamingable web bundle formats are also on todo, meaning I'll need to start stream bundle as document is being parsed and bundled & ideally such that parallel requests to sub-resources could be streamed in parallel (not forcing them to queue up).

Another thing that I intend to change about current architecture is the source DOM tree mutation, that currently occurs during freeze-drying.

Finally I want to also add a way to plug things like CSS parsers which currently are baked in.

Hope this gives enough context of what's on my list.

Treora commented 4 years ago

Happy anniversary to this PR. :birthday: (just kidding, not proud of this..)

As discussed, I started with converting the project to typescript; took a while, but it’s pretty much done, see #52! (any comments there are welcome, it’s my first typescript project)

Next up is getting this functionality in, using this PR as reference but obviously writing everything anew.