cloudfour / lighthouse-parade

A Node.js command line tool that crawls a domain and gathers lighthouse performance data for every page.
MIT License
358 stars 14 forks source link

add option to limit number of pages crawled? #71

Open techieshark opened 3 years ago

techieshark commented 3 years ago

Hey this is a cool tool. Here's a feature idea assuming you're open to it.

Currently one can use:

--max-crawl-depth  2

to get, say, index page + 1st linked pages.

But maybe there are a lot of linked pages, and you just need a representative sample which is more than 1 page but not tons of pages.

So maybe another option like:

--max-crawled-pages N # or just --max-pages ?

and the crawler stops after it exhausts all pages allowed by other options, or N, whichever is smaller.

One might then use it like:

lighthouse-parade --max-crawl-depth 2 --max-crawled-pages 20 example.com
calebeby commented 3 years ago

Hmm, this is an interesting suggestion! One concern is that the pages that are crawled would be nondeterministic, i.e. if you ran lighthouse-parade twice with the same flags it could crawl a different set of pages because of pages loading at different speeds, throttling, etc. The "first n pages" is not necessarily a representative sample of all the pages on the site. Do you have a suggestion of how to make the crawled pages more deterministic & representative of the whole site?

techieshark commented 3 years ago

Interesting challenge, that didn't occur to me.

For my use, I'd only ever want the page I point it at, plus some (and ideally yes, always the same) set of linked pages from that page. So like, index page plus 20 pages linked off it. In that case, it seems like it would always be deterministic to the extent the page itself isn't changing.

Given linked_page_limit = --max-linked-pages N (or --max-leaf/outer-pages N):

  1. fetch the index page
  2. let all_index_links be an array of all links on the index page (removing duplicates).
  3. let index_links be the first linked_page_limit items in all_index_links
  4. fetch all pages in index_links
  5. run lighthouse on the array [index page, ... index_links]

I could see the complication increasing if this were used with a max crawl depth of more than 1, so maybe they're just mutually exclusive options? One would either specify to crawl some fixed depth entirely, or use this "index page plus N children" mode.

OTOH, yeah if it is going to add too much complexity perhaps it's not necessary just for my use case. (Maybe if I could just run lighthouse parade multiple times, specifying a single page each time, and have those results all lumped into the same csv/report that could do the trick?)

mgifford commented 1 year ago

Just wanted to provide support for this. --max-crawl-depth 2 on one site gives me 50 pages. --max-crawl-depth 3 I stopped somewhere after 2k pages.

There should be somewhere in between.

calebeby commented 1 year ago

@mgifford the new version (on the next branch currently) will support stopping the command with ctrl-c when you have enough output, the results (up to that point) will all be saved in the output, so you can stop it at any point you want.

mgifford commented 1 year ago

Excellent.. Happy to hear this.

mgifford commented 1 year ago

@calebeby what is the best way to test with the next branch?

I'm currently running with: npx lighthouse-parade https://www.example.ccom ./lighthouse-parade-data --max-crawl-depth 3

calebeby commented 1 year ago

Hi @mgifford! I published a beta of it on the next tag on npm: https://www.npmjs.com/package/lighthouse-parade?activeTab=versions. You can install it with npm i -g lighthouse-parade@next, or you can use it through npx like this: npx lighthouse-parade@next https://www.example.ccom/ ./lighthouse-parade-data --max-crawl-depth 3. I have been meaning to finalize the release for quite a while now, but have been super busy with school.

Let me know if you run into anything else!

mgifford commented 1 year ago

That's great. Might want to just add that to https://github.com/cloudfour/lighthouse-parade

Thanks!