WebMemex / freeze-dry

Snapshots a web page to get it as a static, self-contained HTML document.
https://freezedry.webmemex.org
The Unlicense
270 stars 18 forks source link

Is it possible to use freeze-dry from server? #57

Open em429 opened 2 years ago

em429 commented 2 years ago

Hi! Thank you for this awesome library!

I'm building a simple website archival API (currently just submits URLs to selected archive sites) and I'd love to add freeze-dry as an addition to it -- I am relatively noob to the javascript world though, so I'm a bit lost on how to approach this;

I understand freeze-dry runs in the browser context (?), so something like playwright will be needed to do this which is what I've been trialing.

I tried to modify and run the playwright tests in the 'customisation' branch as a hacky starting point, and I'm currently stuck with this error when running npm run test

page.evaluate: ReferenceError: freezeDry is not defined

   > 17 |   const html = await page.evaluate('freezeDry(document, { now: new Date(1534615340948) })')
           |                           ^
      18 |   console.log(html)
em429 commented 2 years ago

I understand snowpack is used to bundle freezedry into playwright; is there a direct way of doing this, without running playwright in the test context, but when using it as a module?

Treora commented 2 years ago

Hi, glad you like freeze-dry. Running freeze-dry in a headless browser is indeed the solution; using a test framework like playwright may be unconventional, but seems at least worth a try (I was thinking of trying it too).

In the playwright-based tests I made, I had some trouble getting freezeDry to be available inside the page, and the quick and easy solution I chose was to just include freezeDry in each test page. It could be included in other ways, e.g. playwright’s evaluate function can be used:

const html = await page.evaluate(`
    (async () => {
      const freezeDryModule = await import('…somewhere…/freeze-dry/index.js')
      const freezeDry = freezeDryModule.default
      return await freezeDry()
    })()
  `);

This requires freeze-dry being hosted somewhere (and, since it is on a different domain than the page being snapshotted, it would need to be served with a CORS header: Access-Control-Allow-Origin '*'), perhaps on a server running on localhost (like the snowpack dev used in the tests).

Alternatively you could try pass the whole freezedry code to page.evaluate(…) (possibly even by passing a data: URL to the import statement above).

In either case, you’d first need freeze-dry as a js script/module. A tool like Vite can bundle the code into a single file, that could help here. I plan to make such a pre-bundled file available asap; for which I consider swapping Snowpack for Vite (also Vite is now recommended by Snowpack, as the latter just announced it will no longer be mainained).

em429 commented 2 years ago

Hi, glad you like freeze-dry. Running freeze-dry in a headless browser is indeed the solution; using a test framework like playwright may be unconventional, but seems at least worth a try (I was thinking of trying it too).

In the playwright-based tests I made, I had some trouble getting freezeDry to be available inside the page, and the quick and easy solution I chose was to just include freezeDry in each test page. It could be included in other ways, e.g. playwright’s evaluate function can be used:

const html = await page.evaluate(`
    (async () => {
      const freezeDryModule = await import('…somewhere…/freeze-dry/index.js')
      const freezeDry = freezeDryModule.default
      return await freezeDry()
    })()
  `);

This requires freeze-dry being hosted somewhere (and, since it is on a different domain than the page being snapshotted, it would need to be served with a CORS header: Access-Control-Allow-Origin '*'), perhaps on a server running on localhost (like the snowpack dev used in the tests).

Alternatively you could try pass the whole freezedry code to page.evaluate(…) (possibly even by passing a data: URL to the import statement above).

In either case, you’d first need freeze-dry as a js script/module. A tool like Vite can bundle the code into a single file, that could help here. I plan to make such a pre-bundled file available asap; for which I consider swapping Snowpack for Vite (also Vite is now recommended by Snowpack, as the latter just announced it will no longer be mainained).

Thank you for the detailed answer and the code sample, got me started! :)

em429 commented 2 years ago

Success!! Thanks so much! I managed to get vite bundling going and got a working cli PoC, for example:

node freeze.js https://100r.co > 100r.html       

This assumes the vite bundled es.js is served on localhost/freeze-dry.es.js. Looking into how to include the bundled file into page.evaluate.

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(process.argv[2]);

  const html = await page.evaluate(`
      (async () => {
        const freezeDryModule = await import('http://localhost/freeze-dry.es.js')
        const freezeDry = freezeDryModule.default
        return await freezeDry(document)
      })()
    `);

  process.stdout.write(html);

  await browser.close();
})();

I'll try sending in a PR with the Vite build soon

trenta3 commented 2 years ago

Hi @qirpi Can I ask you to share a working way to use puppeteer and freezeDry to archive webpages from Node? I've tried with the script that you provide in the above reply and using the freeze-dry.es.js you have in the PR #58, but for me it fails with the following error:

Error: Evaluation failed: TypeError: Object prototype may only be an Object or null: undefined
    at Function.create (<anonymous>)
    at _inheritsLoose (http://localhost/freeze-dry.es.js:3459:33)
    at http://localhost/freeze-dry.es.js:3464:5
    at http://localhost/freeze-dry.es.js:3485:4
    at http://localhost/freeze-dry.es.js:3489:3
    at ExecutionContext._ExecutionContext_evaluate (file:///XXX/node_modules/puppeteer/lib/esm/puppeteer/common/ExecutionContext.js:231:19)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async ExecutionContext.evaluate (file:///XXX/node_modules/puppeteer/lib/esm/puppeteer/common/ExecutionContext.js:114:16)
    at async main (file:///XXX/archive.js:20:18)

Thank you!