Closed rviscomi closed 2 years ago
The metadata has a tested_url
field which has the page URL independent of anything the agent might do (the fixes to report URL
correctly are also in place but the metadata is safest).
Lighthouse is updating to 9.6 today. Is it possible to update the test agents before the crawl reruns?
They will auto update when they spin up as long as the npm stable package is updated
On Tue, May 10, 2022 at 8:50 PM Rick Viscomi @.***> wrote:
Lighthouse is updating to 9.6 today. Is it possible to update the test agents before the crawl reruns?
— Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/data-pipeline/issues/44#issuecomment-1123057547, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADMOBN4VAG5TJDBGDYDK3TVJL737ANCNFSM5VSAYJ5Q . You are receiving this because you were assigned.Message ID: @.***>
@pmeenan: the crawl should be ready to restart when you see this in the morning (Thursday the 12th).
@giancarloaf and I went through the remaining TODO items at the top of this issue and we should be good to go. I left the "flush Pub/Sub queue" one unchecked because we were still seeing some lingering messages coming through from the GCS backup of the first May crawl. @giancarloaf will be monitoring the Pub/Sub messages tonight to ensure that the queue is completely flushed by morning. (If not, at worst we'll have some summary data from both crawls in BQ, which we can clear out in SQL as needed)
Update: the dashboard is still showing many messages coming through:
Update: still going strong as of 7am... I don't think we're able to start the crawl until that settles down :(
Update: a rogue process kept moving HAR files between crawls
subdirs and triggering pubsub messages. @giancarloaf killed the process and the noise has subsided. Should be good to start the crawl.
@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?
@rviscomi @giancarloaf, It looks like the android and chrome May 1 crawls/ directories have tests in them (from the rogue process moving things around?). Do they need to be moved into the backup folder first?
Yep, this is currently in progress using the worker vm. Rick is seeing very slow transfer rate (~100K files per hour) and has decided it would be best to start a new crawl under a different name, to be renamed later.
I will also be restarting the streaming pipeline to incorporate changes from #49 merged earlier today.
Closing this out. We're rerunning the crawl with today's date to avoid overwriting any of the previous data.
url
,pageid
, andrequestid
fields from the metadata above @giancarloafAnything else?