WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
271 stars 18 forks source link

make resource fetcher configurable #37

Closed Gozala closed 5 years ago

Gozala commented 6 years ago

This change introduces fetchResource option, which by default is self.fetch (to match current behavior). This allows passing custom resource fetching mechanism that can overcome CORS limitations imposed by the host. Specifically it enables:

  1. Route resource fetches through own CORS proxy to overcome lack of Access-Control-Headers.
  2. Allows one to use alternate resource fetching mechanism e.g. Beakers globalFetch.
Treora commented 6 years ago

Thanks a lot for moving this forward! I will have to take a better look at this, it may have to wait until the weekend.

Treora commented 5 years ago

Sorry it took a while to get around to this. Jotting down my current thoughts, will try think about it again this weekend.

Passing a custom function that is API-compatible with fetch feels like a natural solution; however this API is rather big, while we only really need a blob and the final destination url. It would be fine if you want to just swap in another fetch implementation or arguments, but I think there are also cases where more customisation is desired.

For example, one use case is to have fetch be executed in the background script of a browser extension, while freeze-dry runs in the content script (see issue #7). However, since a Response cannot be passed between the scripts, I would have to recreate a fake Response object in the content script. Of this Response only the .url .blob() and .text() attributes are used (of which the last could be dropped, a blob suffices). Annoyingly using new Response(blob) would not work, as the url cannot be set manually..

I could try use something like comlink to solve this issue around non-serialisability of objects with methods; but there may be cases where one wants to implement a full custom fetcher using e.g. local storage instead of an actual fetch implementation.

Therefore, I wonder if it is worth refactoring a little, and have our custom fetcher have a simpler API, returning e.g. just an object { blob, url }. Or we could support both this simple replacement of the fetch implementation, and a way to replace the whole fetching logic, but that feels like overdoing it.

TL;DR: After writing this I am most tempted to try refactor fetchSubresource to return a { blob, url } object, and allow the user to substitute fetchSubresource as a whole. Happy to hear if you have thoughts on this.

Gozala commented 5 years ago

Sorry it took a while to get around to this. Jotting down my current thoughts, will try think about it again this weekend.

No worries I’m pretty happy using my forked version. I’m actually amazed how well it works, thank you for putting such great tool together!!!

Passing a custom function that is API-compatible with fetch feels like a natural solution; however this API is rather big, while we only really need a blob and the final destination url.

I was actually wondering why final url was needed ?

It would be fine if you want to just swap in another fetch implementation or arguments, but I think there are also cases where more customisation is desired.

For example, one use case is to have fetch be executed in the background script of a browser extension, while freeze-dry runs in the content script (see issue #7). However, since a Response cannot be passed between the scripts, I would have to recreate a fake Response object in the content script. Of this Response only the .url .blob() and .text() attributes are used (of which the last could be dropped, a blob suffices). Annoyingly using new Response(blob) would not work, as the url cannot be set manually..

As a matter of fact thats is exactly what I’m doing :) And got bit by it too. That being said it’s easy to overcome as I described here.

https://github.com/beakerbrowser/beaker-core/issues/16

So I wouldn’t really worry about that too much. That being said, if implementation actually just uses url & blob you could juet request that & maybe even make url optional falling back to requested url.

I could try use something like comlink to solve this issue around non-serialisability of objects with methods; but there may be cases where one wants to implement a full custom fetcher using e.g. local storage instead of an actual fetch implementation.

Again I think creating Request instances is fairly easy & url limitation is annoying but also fairly easy to address, you could probably even expose function that takes Request args creates instances and does ser url so that it’s not even annoying. Another thing I’d suggest to lax url requirement so that plan Request will do & use requested url if returned instance has none.

Therefore, I wonder if it is worth refactoring a little, and have our custom fetcher have a simpler API, returning e.g. just an object { blob, url }. Or we could support both this simple replacement of the fetch implementation, and a way to replace the whole fetching logic, but that feels like overdoing it.

It’s up to you, but I wouldn’t worry until it’s a problem. It’s also worse noting that Request could be constructed from streams and forcing blobs would impose unnecessary buffering.

TL;DR: After writing this I am most tempted to try refactor fetchSubresource to return a { blob, url } object, and allow the user to substitute fetchSubresource as a whole. Happy to hear if you have thoughts on this.

Gozala commented 5 years ago

so to summarize I’d suggest following:

  1. Make returned resource url an optional and if non provided assume no redirects occurred.
    1. Expose freezeDry.Request subclass of Request with compatible API except also accepting & setting url option, so when needed it could be used hassle free.
Treora commented 5 years ago

I was actually wondering why final url was needed ?

Because it is the base URL for relative URLs inside the resource.

It’s also worse noting that Request could be constructed from streams and forcing blobs would impose unnecessary buffering.

Well noted; though I fear I have forced buffering into the design of freeze-dry anyhow (we crawl down the resource tree, and only when the whole tree is available we walk up again from the leaves).

I feel it is nice to allow the callback to return a { blob, url }; this object actually matches with what is called a 'resource' throughout the code base. Possibly useful in the future, this approach would also allow the callback to add some custom field to the object (e.g. { blob, url, note }), which it can then read again in a custom subresourceToUrl that would receive the resource (discussed in #8).

Nevertheless, though I usually like to have APIs with a single way of using it, in this case it seems trivial to allow for either way: we just check if .blob is a function, and if so we call and await it:

async function fetchSubresource(url, options) {
    const fetchFunction = options.fetchResource || self.fetch
    // TODO investigate whether we should supply origin, credentials, ...
    const resourceOrResponse = await fetchFunction(url, {
        cache: 'force-cache',
        redirect: 'follow',
    })
    const resource = {
        // If we got a Response, we wait for the content to arrive.
        blob: typeof resourceOrResponse.blob === 'function'
            ? await resourceOrResponse.blob()
            : resourceOrResponse.blob,
        // Read the final URL of the resource (after any redirects), or fall back to the known URL.
        url: resourceOrResponse.url || url,
    }
    return resource
}

I also added the .url fallback you suggested. I am not certain about this though, as it might lead to people forgetting to provide the final URL, and not noticing their bug until some website uses relative URLs in a redirected stylesheet.

Treora commented 5 years ago

I moved forward with the approach explained in my previous message, except I reverted the .url fallback for the mentioned fear that it would let bugs go unnoticed.

I merged into master, but feel free to still suggest changes if anything could be improved.

@Gozala: For clarity, could you confirm you are okay with waiving any copyrights on your contribution, to publish it in the public domain?

Gozala commented 5 years ago

@Gozala: For clarity, could you confirm you are okay with waiving any copyrights on your contribution, to publish it in the public domain?

Yes, I'm ok with that

Gozala commented 5 years ago

I moved forward with the approach explained in my previous message, except I reverted the .url fallback for the mentioned fear that it would let bugs go unnoticed.

Sounds good! I also came around to agreeing that if it's better to require .url instead of just assuming it's the same as requested.