Closed floledermann closed 5 years ago
@floledermann a puppeteer PR would be awesome and most welcome! When node-warc was released puppeteer was not publically announced and since then have not had the time to include it. But I have looked at its API and it should not be hard to add it.
Currently, node-warc is used to back two other projects Squidwarc (archival crawler using Chrome with or without a head) and WAIL which may give some insight into how to accomplish this.
It would also be awesome to use puppeteer as a backend for Squidwarc since it relies on the host system to provide Chrome while puppeteer will automatically download Chromium for you.
@floledermann per #4 is there anything that node-warc can do to help in this effort if it is still underway.
@floledermann node-warc now supports puppeteer (via both CDPSession
and page.on('requests')
)
I will be closing this issue once I publish this update to npm.
However you can use this today via yarn install https://github.com/N0taN3rd/node-warc.git
https://github.com/GoogleChrome/puppeteer/ seems to be the hot thing today to render web pages in headless/remote controlled environments. It would be great to add this as an option to generate WARCs, i.e. a PuppeteerWARCGenerator. One main advantage is that it installs completely seamless through npm and runs on headless boxes, so the setup effort is minimized.
I will investigate this in my current project and may come back with a PR - just wanted to collect any thoughts that may have been given to this already beforehand.