cyrus-and / chrome-har-capturer

Capture HAR files from a Chrome instance
MIT License
535 stars 90 forks source link

Not capturing all page requests #59

Closed smenzer closed 6 years ago

smenzer commented 6 years ago

I've implemented the chrome-har-capturer to do a pretty simple scan of a website. However, it doesn't seem to be capturing all requests as compared to when I browse to the same webpage myself in Chrome.

Here's the relevant part of my code:

const chromeLauncher = require('chrome-launcher');
const chc = require('chrome-har-capturer');
const fs = require('fs');
const outputFileName = './output/' + new Date().toISOString() + '.har';
const pageLoadTimeout = 15000;

chromeLauncher.launch({
    chromeFlags: [
        '--disable-gpu',
        '--headless'
    ]
}).then(chrome => {
    var sites = getSiteList(); // returns an array of urls to scan
    var c = chc.run(sites, {
        host: chrome.host,
        port: chrome.port,
        timeout: pageLoadTimeout
    });

    c.on('har', function(har) {
        fs.writeFileSync(outputFileName, JSON.stringify(har), 'utf8');
        chrome.kill();
    });
});

If I use the following url as the only site to scan - https://www.msn.com/fr-fr/divertissement/celebrity/photos-comme-adriana-karembeu-ces-femmes-sont-devenues-m%C3%A8re-apr%C3%A8s-45-ans/ss-AAvhng6?li=BBoJIji - I get a HAR with 20 requests (viewing it at http://www.softwareishard.com/har/viewer/); while if I manually browse to the website in Chrome Incognito mode, I see well over 300 requests being made. Is there something wrong with my code?

Is there a way to tell the run function to wait a certain amount of time before finishing? I think part of the issue is that the site takes several seconds to load, while the HAR is being generated pretty quickly, so it's just not waiting long enough. I've tried the timeout option as shown above, but that doesn't seem to make much of a difference - the code completes much faster than the 15 seconds I'm trying to wait.

I've attached a HAR file that I just generated with this code: 2018-04-26T08:30:13.965Z.har.zip

Thanks! Scott

cyrus-and commented 6 years ago

This happens because most of the request starts asynchronously after the page load event and thus not captured. You need to use the -g,--grace option of the command line utility or implement something similar if you plan to use it as a library. The timeout is the maximum allowed time to load a page before giving up.

In short, you need something like:

function postHook(url, client) {
    return new Promise((fulfill, reject) => {
        // allow the user specified grace time
        setTimeout(fulfill, program.grace || 0);
    });
}

var c = chc.run(sites, {
    host: chrome.host,
    port: chrome.port,
    postHook
});
smenzer commented 6 years ago

Thank you...that worked exactly as I needed!

flotwig commented 5 years ago

This is perfect for what I needed. FYI, If you're using Bluebird Promises, there's an even simpler way to write it:

var c = chc.run(sites, {
    host: chrome.host,
    port: chrome.port,
    postHook: () => Promise.delay(WAIT_MS)
});

EDIT: Even better, you can use the CDP instance to wait until there's no network traffic being sent:

var c = chc.run(sites, {
    host: chrome.host,
    port: chrome.port,
    postHook: (_, cdp) => {
      let timeout;
      return new Promise((resolve) => {
        cdp.on('event', (message) => {
          if (message.method === 'Network.dataReceived') {
            // reset timer
            clearTimeout(timeout)
            timeout = setTimeout(resolve, 1000)
          }
        })
      })
    },
});