HTTPArchive / httparchive.org

The HTTP Archive website hosted on App Engine
https://httparchive.org
Apache License 2.0
334 stars 43 forks source link

Some reports have failed for 2022_05_01 #601

Closed github-actions[bot] closed 2 years ago

github-actions[bot] commented 2 years ago

Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/drupal/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/magento/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/wordpress/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top10k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top100k/2022_05_01/vulnJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/bootupJs.json Incorrect Status code 404 found for https://cdn.httparchive.org/reports/top1m/2022_05_01/vulnJs.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/drupal/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/magento/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/wordpress/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top10k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top100k/a11yButtonName.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/numUrls.json 2022_05_01 not found in body for https://cdn.httparchive.org/reports/top1m/a11yButtonName.json

See latest log in GitHub Actions

tunetheweb commented 2 years ago

Note we're still missing summary_pages.2022_05_01_mobile.

@rviscomi I've lost track of where we ended up with the first May run. Is that gone now? Or is it still available?

Note, this is not a rush and can wait until your return. But when June crawl finishes, the reports will try to run for any missing data (i.e. May, Mid-May and June) and so might fill in weird values if May runs are in such a bad state.

rviscomi commented 2 years ago

Yeah 2022_05_01 should only be the home pages from the 2022_05_12 crawl. The action item in https://github.com/HTTPArchive/data-pipeline/issues/72#issuecomment-1141359954 is to regenerate the summary tables for the 2022_05_12 mobile crawl, so once that's complete we can filter it down to the home page data and alias to 2022_05_01 for the reporting.

tunetheweb commented 2 years ago

Ah OK. So we;'ve thrown away 2022_05_01 completely and back-populated it from 2022_05_12?

Only thing is the 2022_05_12 dataset will still show up as another point in the graph (with same values as 2022_05_01 is they are a complete copy).

Or is plan to drop 2022_05_12 tables after back populating 2022_05_01?

rviscomi commented 2 years ago

Yeah the first run of 2022_05_01 is no longer around because it had bad url values. We're backdating home page data from the 2022_05_12 crawl instead. More info about the migration plan to secondary pages in this issue: https://github.com/HTTPArchive/data-pipeline/issues/51. In short:

rviscomi commented 2 years ago

@tunetheweb would you be able to regenerate the reports now that the 2022_05_01 tables are all set up?

tunetheweb commented 2 years ago

Do the 20220512 tables still exist with home + secondary? If so they will be included as well and probably don’t want to, until we come up with a strategy of how to include them.

Or can explicitly just run 20220501 for now and we’ll just need to decide on this before 20220601.

rviscomi commented 2 years ago

Yeah the 05_12 tables still exist with secondary pages. IIUC by setting the YYYY_MM_DD param, generate_reports.sh will cap the timeseries at 05_01. When 06_01 runs, we'll hopefully have moved the 05_12 data into the new all dataset.

tunetheweb commented 2 years ago

Yep. Will kick that off in about an hour or so.

tunetheweb commented 2 years ago

That's running now. Will check in on it tomorrow am.

tunetheweb commented 2 years ago

That's all complete now.

@pmeenan any thoughts on the massive performance improvements on these graphs for mobile: https://httparchive.org/reports/loading-speed

pmeenan commented 2 years ago

My first guess/worry was going to be the CPU throttling but looking at the mobile template for tests, the throttling is still configured for 8x and the bandwidth is set to the 4G speeds.

I'm going to see if I can spot check a few pages to see if anything jumps out. Could be that the CPU throttling in Chrome broke again (or that I did something wrong on the switch to the new pipeline)

tunetheweb commented 2 years ago

Ok let’s close this issue an open a new one for that. I’ll do that now.