N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

Added support for writing Request (library!) WARC records #15

Closed hyl closed 5 years ago

hyl commented 5 years ago

Hi! I've written a WARC writer for request, erm... requests. Due to the nature of WARC files and the opinionated defaults that request has, the request made (or the Request library defaults) need to be instantiated with the following options:

{
    resolveWithFullResponse: true, // get all request and response properties
    simple: false // return 4xx, 5xx etc as "successful" responses so we can WARC the result 
}

Again, using this PR to try and kick off a wider discussion on this: the main thing that concerns me is the need to configure request in a certain matter for WARC writing to work. Could we optionally expose our own version of request from inside node-warc as a RequestCap that is pre-configured to make the request in the right manner?

N0taN3rd commented 5 years ago

@hyl I believe, as you suggested, that a custom RequestCap to make the requests would be in order for this.

Though I wonder if it would make sense to combine RequestCap and a Writter for this purpose?

Rational: The request library is a node "native" wrapper around HTTP (node <--> HTTP) not a wrapper around the CDP (node <--> CDP <--> browser <--> HTTP) Each HTTP method wrapper request provides is essentially an generateWarcEntry since we normally make the Network.getResponseBody / Network.getRequestPostData calls there anyway.

N0taN3rd commented 5 years ago

@hyl super big thanks for this and merging this PR finally!

mooniker commented 3 years ago

Is there an example of how to use the request lib? I'm trying to use node-warc with some existing code that uses Axios. I'm not following how to use this without the generateWarc method. Do I have to make our own capturer?

hyl commented 3 years ago

Hi @mooniker!

Here’s an example for RequestLibWARCWriter. I’m using request-promise in this example:

import { RequestLibWARCWriter } from 'node-warc';
import rp from 'request-promise';

const url = 'https://www.example.com';

// set up request client with some defaults so that we can access full request info 
const requestClient = rp.defaults({
    resolveWithFullResponse: true,
    simple: false,
    encoding: null
});

const resp = await requestClient({
    uri: url,
    method: 'GET',
    // kill the socket after 10s of inactivity
    timeout: 10000
}).catch((err: any) => {
    logger.error(`Could not download ${url}`);
    logger.error(err.stack);
    return false;
});

const requestWriter = new RequestLibWARCWriter();
requestWriter.initWARC('/path/to/warcfile', true);
await requestWriter.generateWarcEntry(resp);

If you wanted to use Axios in-place of node-request, I imagine a new writer class would have to be created unless the response class from Axios is interchangeable with the one that comes from node-request. Hopefully RequestLibWARCWriter should be a good starting point for that if you decide to go down that path though!

mooniker commented 3 years ago

@hyl Thanks heaps!

That initWARC method on the base class was exactly what I was missing.