Closed github-actions[bot] closed 2 years ago
Note we're still missing summary_pages.2022_05_01_mobile
.
@rviscomi I've lost track of where we ended up with the first May run. Is that gone now? Or is it still available?
Note, this is not a rush and can wait until your return. But when June crawl finishes, the reports will try to run for any missing data (i.e. May, Mid-May and June) and so might fill in weird values if May runs are in such a bad state.
Yeah 2022_05_01 should only be the home pages from the 2022_05_12 crawl. The action item in https://github.com/HTTPArchive/data-pipeline/issues/72#issuecomment-1141359954 is to regenerate the summary tables for the 2022_05_12 mobile crawl, so once that's complete we can filter it down to the home page data and alias to 2022_05_01 for the reporting.
Ah OK. So we;'ve thrown away 2022_05_01 completely and back-populated it from 2022_05_12?
Only thing is the 2022_05_12 dataset will still show up as another point in the graph (with same values as 2022_05_01 is they are a complete copy).
Or is plan to drop 2022_05_12 tables after back populating 2022_05_01?
Yeah the first run of 2022_05_01 is no longer around because it had bad url
values. We're backdating home page data from the 2022_05_12 crawl instead. More info about the migration plan to secondary pages in this issue: https://github.com/HTTPArchive/data-pipeline/issues/51. In short:
all.pages
and all.requests
tables should combine home and secondary pagesall
pipeline is still WIP so 2022_05_12 is sticking around to give everyone early access to secondary page data@tunetheweb would you be able to regenerate the reports now that the 2022_05_01 tables are all set up?
Do the 20220512 tables still exist with home + secondary? If so they will be included as well and probably don’t want to, until we come up with a strategy of how to include them.
Or can explicitly just run 20220501 for now and we’ll just need to decide on this before 20220601.
Yeah the 05_12 tables still exist with secondary pages. IIUC by setting the YYYY_MM_DD
param, generate_reports.sh
will cap the timeseries at 05_01. When 06_01 runs, we'll hopefully have moved the 05_12 data into the new all
dataset.
Yep. Will kick that off in about an hour or so.
That's running now. Will check in on it tomorrow am.
That's all complete now.
@pmeenan any thoughts on the massive performance improvements on these graphs for mobile: https://httparchive.org/reports/loading-speed
My first guess/worry was going to be the CPU throttling but looking at the mobile template for tests, the throttling is still configured for 8x and the bandwidth is set to the 4G speeds.
I'm going to see if I can spot check a few pages to see if anything jumps out. Could be that the CPU throttling in Chrome broke again (or that I did something wrong on the switch to the new pipeline)
Ok let’s close this issue an open a new one for that. I’ll do that now.
Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/vulnJs.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/a11yButtonName.json
See latest log in GitHub Actions