Make page level data available in BigQuery

derekperkins commented 3 years ago

The origin level data is already there, and BigQuery is perfectly suited for broader page level analysis. The api quotas make it hard to analyze large sites.

rviscomi commented 3 years ago

Unfortunately we're unable to add page-level data to BigQuery. Could you describe the API limitations you're hitting? Also are you doing any kind of rate limiting or batching?

derekperkins commented 3 years ago

The official api docs aren't super clear about quotas, but according to https://github.com/treosh/crux-api#batch-request, each individual request inside the batch counts towards the quota. Before today, everything I read seemed to point at pagespeedonline.googleapis.com/default being the relevant quota, limited to 25k reqs / day. Today I found chromeuxreport.googleapis.com/default, which maxes out at 150 / min. I'm not sure how that works with the batch system that lets you include up to 1000 in a single request. At 150 / min, that limits you to 216k reqs / day, which doesn't allow for much analysis per url if you want to do any segmentation by device, country, or internet speed. If I'm misunderstanding and rate limiting only applies once per batch, that puts the limit at 216M that would be much better.

A way to query all URLs included in the CrUX database for a specific origin. A lot of hit and miss if you query specific URLs and there are occasions where it's hard to understand which group of URLs are the worst offenders when looking at origin data. https://twitter.com/jlhernando/status/1389648558614368258,

As mentioned here, being able to query for coverage instead of individually hitting the api repeatedly would reduce the need for so much quota.

Are you able to share anything about the reasoning for not making page level data in BigQuery? Are there privacy concerns?

rviscomi commented 3 years ago

The Treo docs are correct that queries within a batched request still count towards the quota.

As mentioned here, being able to query for coverage instead of individually hitting the api repeatedly would reduce the need for so much quota.

Could you elaborate on what you mean by "query for coverage"? Not sure if it's referring to getting feedback on current quota usage or a feature request for better coverage of URLs.

Are you able to share anything about the reasoning for not making page level data in BigQuery? Are there privacy concerns?

Yeah, we would want to avoid anyone being able to say "show me all pages for a given origin" even if it's not their site. Site owners should know what all of their URLs are and how popular they are, so it should be possible to create an ordered list of URLs and to query the most popular ones, which are most likely to be included in the dataset and have the biggest influence over the site's aggregate CWV performance.

GoogleChrome / CrUX

Make page level data available in BigQuery #10