N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

Investigate generating WARCS through headless Chromium / puppeteer #2

Closed floledermann closed 5 years ago

floledermann commented 6 years ago

https://github.com/GoogleChrome/puppeteer/ seems to be the hot thing today to render web pages in headless/remote controlled environments. It would be great to add this as an option to generate WARCs, i.e. a PuppeteerWARCGenerator. One main advantage is that it installs completely seamless through npm and runs on headless boxes, so the setup effort is minimized.

I will investigate this in my current project and may come back with a PR - just wanted to collect any thoughts that may have been given to this already beforehand.

N0taN3rd commented 6 years ago

@floledermann a puppeteer PR would be awesome and most welcome! When node-warc was released puppeteer was not publically announced and since then have not had the time to include it. But I have looked at its API and it should not be hard to add it.

Currently, node-warc is used to back two other projects Squidwarc (archival crawler using Chrome with or without a head) and WAIL which may give some insight into how to accomplish this.

It would also be awesome to use puppeteer as a backend for Squidwarc since it relies on the host system to provide Chrome while puppeteer will automatically download Chromium for you.

N0taN3rd commented 6 years ago

@floledermann per #4 is there anything that node-warc can do to help in this effort if it is still underway.

N0taN3rd commented 5 years ago

@floledermann node-warc now supports puppeteer (via both CDPSession and page.on('requests')) I will be closing this issue once I publish this update to npm. However you can use this today via yarn install https://github.com/N0taN3rd/node-warc.git

N0taN3rd commented 5 years ago

node-warc v3.1.0 is now live on npm