Rerun 2022_05_01 - Githubissues

rviscomi commented 2 years ago

[x] Make a copy of the first crawl's data on GCS to not be overwritten @giancarloaf
[x] Make a copy of the first crawl's data on BigQuery @giancarloaf
[x] Enable crawling 1 level of secondary pages @pmeenan
[x] Add the ability to distinguish between primary/secondary pages in the Dataflow pipeline @giancarloaf waiting on #12
[x] Add metadata to identify the original test URL to the HAR @pmeenan
[x] Update the Dataflow pipeline to parse the page url, pageid, and requestid fields from the metadata above @giancarloaf
[x] Flush Pub/Sub queue before starting the crawl @giancarloaf
[x] Restart the summary pipeline @giancarloaf
[x] Start the second crawl using the same URLs as before @pmeenan

Anything else?

pmeenan commented 2 years ago

The metadata has a tested_url field which has the page URL independent of anything the agent might do (the fixes to report URL correctly are also in place but the metadata is safest).

rviscomi commented 2 years ago

Lighthouse is updating to 9.6 today. Is it possible to update the test agents before the crawl reruns?

pmeenan commented 2 years ago

They will auto update when they spin up as long as the npm stable package is updated

On Tue, May 10, 2022 at 8:50 PM Rick Viscomi @.***> wrote:

Lighthouse is updating to 9.6 today. Is it possible to update the test agents before the crawl reruns?

— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/44#issuecomment-1123057547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBN4VAG5TJDBGDYDK3TVJL737ANCNFSM5VSAYJ5Q . You are receiving this because you were assigned.Message ID: @.***>

rviscomi commented 2 years ago

@pmeenan: the crawl should be ready to restart when you see this in the morning (Thursday the 12th).

@giancarloaf and I went through the remaining TODO items at the top of this issue and we should be good to go. I left the "flush Pub/Sub queue" one unchecked because we were still seeing some lingering messages coming through from the GCS backup of the first May crawl. @giancarloaf will be monitoring the Pub/Sub messages tonight to ensure that the queue is completely flushed by morning. (If not, at worst we'll have some summary data from both crawls in BQ, which we can clear out in SQL as needed)

Update: the dashboard is still showing many messages coming through:

Update: still going strong as of 7am... I don't think we're able to start the crawl until that settles down :(

Update: a rogue process kept moving HAR files between crawls subdirs and triggering pubsub messages. @giancarloaf killed the process and the noise has subsided. Should be good to start the crawl.

pmeenan commented 2 years ago

@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?

giancarloaf commented 2 years ago

@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?

Yep, this is currently in progress using the worker vm. Rick is seeing very slow transfer rate (~100K files per hour) and has decided it would be best to start a new crawl under a different name, to be renamed later.

I will also be restarting the streaming pipeline to incorporate changes from #49 merged earlier today.

rviscomi commented 2 years ago

Closing this out. We're rerunning the crawl with today's date to avoid overwriting any of the previous data.

HTTPArchive / data-pipeline

Rerun 2022_05_01 #44