HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
613 stars 174 forks source link

Write queries and add to the repo #62

Closed rviscomi closed 5 years ago

rviscomi commented 5 years ago

When the Analyst team generates queries for each metric, they should create a PR to merge it into the repo. This has two benefits: the PR process provides an opportunity for peer review, and it is a place to share and maintain the canonical queries. On the Almanac website we can link directly to the queries from each respective chapter/figure so readers can see exactly how it was calculated and fork it for their own analysis.

For testing queries, you can query the new almanac dataset, which contains desktop/mobile sample tables for 1,000 websites. This smaller dataset should help you refine your queries without incurring the full cost for all ~5M websites.

Query guidelines:

rviscomi commented 5 years ago

@HTTPArchive/developers any takers for the first task? We need to figure out a good home for the queries and create the directory structure.

I think it should be named according to this pattern /sql/2019/05/05.3.sql, but I'm open to suggestions. Some open questions:

rviscomi commented 5 years ago

@KJLarson FYI this is another good first issue (task 1 of 3, creating directory system)

KJLarson commented 5 years ago

Oooh, oooh! I can do this! I've been doing records management for the last couple years and have a Masters of Library and Information Science...this is totally in my realm!

rviscomi commented 5 years ago

Sold! Thanks @KJLarson!

Have a look at the open questions in https://github.com/HTTPArchive/almanac.httparchive.org/issues/62#issuecomment-505404617 and feel free to start a PR with the new sql directories.

KJLarson commented 5 years ago

Here are some initial thoughts and questions I have after looking at the questions from comment #62, the metrics triage spreadsheet, the file structure of HTTPArchive.org, and some records management naming convention articles:

KJLarson commented 5 years ago

Here's my first directory structure thought (I didn't fill it all in; hopefully this is enough to get a picture of what it would look like):

src
+--- sql
     +--- 2019
     |    +--- 01_JavaScript
     |    |       01_01.sql
     |    |       01_02.sql
     |    +--- 02_CSS
     |    |       02_01.sql
     |    |       02_02.sql
     |    |       02_03.sql
     |    +--- 03_Markup
     |    |       03_01.sql
     |    |       03_02.sql
     |    +--- 04_Media
     |    |       04_01.sql
     |    +--- 05_ThirdParties
     |    |       05_01.sql
     |    +--- 06_Fonts
     |    +--- 07_Performance
     |    +--- 08_Security
     |    +--- 09_Accessibility
     |    +--- 10_SEO
     |    +--- 11_PWA
     |    +--- 12_MobileWeb
     |    +--- 13_Ecommerce
     |    +--- 14_CMS
     |    +--- 15_Compression
     |    +--- 16_Caching
     |    +--- 17_CDN
     |    +--- 18_PageWeight
     |    +--- 19_ResourceHints
     |    +--- 20_HTTP2
     +--- 2020
     |    +--- 01_

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

rviscomi commented 5 years ago

That looks perfect! Thanks for the thoughtful approach.

I don't think the metric IDs will sort in order the way they are written now. Adding leading zeros to the metric number should fix sorting issues.

Very good catch.

I think it would be helpful to have the chapter number and name in the directory name. This way, if someone is looking at the list of chapter directories, it doesn't matter if they know just the chapter number or just the chapter name...they can still find what they are looking for

Good suggestion.

Will the same queries be used every year?

This won't be the case because the 2019 queries will explicitly reference the 2019_07_01 dataset.

The metric IDs look a bit different than they do in the spreadsheet. Not sure how much room, if any, there is to stray from how we write the numbers.

Not a problem at all.

I only have two small followup questions: since this isn't used directly by the web server, can we move it out of src? for chapters with multiple words could we delimit words by _?

Bonus question: for queries that could possibly be used in multiple chapters, how do you think they should be named?

KJLarson commented 5 years ago

Yes and yes.

Bonus question: Hmmm...I will have to think about this one. Does multiple mean a couple chapters, most chapters, or somewhere in between? I suppose it wouldn't be ideal to save these queries in different directories with different names. Is there value added knowing that the same query was used in multiple chapters?

rviscomi commented 5 years ago

I think at most there will be overlap for 10 metrics, but probably closer to 2 or 3. Thinking more about this, we should probably still name them according to the chapter they appear in and have duplicates because casual readers who want to explore the queries won't care if it's used somewhere else, they just want to find the corresponding query.

rviscomi commented 5 years ago

PRs with queries for all metrics have been merged or are being reviewed! 🎉