foxdavidj commented 4 years ago

Part I Chapter 3: Markup

Content team

Authors	Reviewers	Analysts	Draft	Queries	Results
@j9t @catalinred @iandevlin	@zcorpan @matuzo @bkardell	@Tiggerito	Doc	*.sql	Sheet

Content team lead: @j9t

Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.

The content team is made up of the following contributors:

New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.

Note: To ensure that you get notifications when tagged, you must be "watching" this repository.

Milestones

0. Form the content team

[x] Jul 6th: Project owners have selected an author to be the content team lead
[x] Jul 13th: The content team has at least one author, reviewer, and analyst (minimally viable team formed)

1. Plan content

[x] Jul 20th: The content team has completed the chapter outline in the draft doc
[x] Jul 27th: Analysts have triaged the feasibility of all proposed metrics

2. Gather data

[x] Jul 27th: Analysts have added all necessary custom metrics and drafted a PR to track query progress
Aug 1 - 31: August crawl
[x] Sep 7th: Analysts have queried all metrics and saved the output to the results sheet

3. Validate results

[x] Sep 14th: The content team has reviewed the results sheet

4. Draft content

[x] Nov 12th: Authors have completed the first draft in the doc
[x] Nov 26th: The content team has prototyped all data visualizations

5. Publication

[x] Nov 26th: The content team has reviewed the final draft, converted to markdown, and filed a PR to add it to the 2020 content directory
Dec 9th: Target launch date

borisschapira commented 4 years ago

I'm afraid I'm not fluent enough in English to create the content. I can help reviewing, though.

j9t commented 4 years ago

I’m happy to contribute to this. With my current workload the least I can commit to is reviewing—I look forward to coordinating with whoever would be driving this section.

ibnesayeed commented 4 years ago

I can review this chapter.

catalinred commented 4 years ago

I'd love to help in any way.

zcorpan commented 4 years ago

I can contribute with ideas for things to query, and review.

iandevlin commented 4 years ago

I would be happy to contribute and/or review.

foxdavidj commented 4 years ago

@j9t thank you for agreeing to be the lead author for the Markup chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.

The immediate next steps for this chapter are:

Establish the rest of your content team. Several other people were interested or nominated (see below), so that's a great place to start. The larger the scope of the chapter, the more people you'll want to have on board.
Start sketching out ideas in your draft doc.
Catch up on last year's chapter and the project methodology to get a sense for what's possible.

There's a ton of info in the top comment, so check that out and feel free to ping myself or @rviscomi with any questions!

foxdavidj commented 4 years ago

@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!

@iandevlin @catalinred @ibnesayeed @zcorpan @iandevlin I've put you down as reviewers for now, and will leave it to @j9t to reassign at his discretion

j9t commented 4 years ago

Thanks @obto—I’m excited to work on this together with all of you who have also expressed interest! 🙏

👉 @iandevlin, @catalinred, @ibnesayeed, @zcorpan, @iandevlin, @matuzo, can you confirm that and how you’d like to be involved? Who would also like to write and co-author, who would like to cover analysis? I like the idea of forming a really strong team together. (Feel free to respond here but also directly through email, as per my profile.)

(If everyone of you is aboard, and if we can split the responsibilities well I think we already have a good setup. I’d wait until all of you confirmed to decide with you whether that’s the case or whether we need more support.)

👉 Do you have preferences for how to coordinate? Not everything will be useful to discuss in this thread; I’m not sure there is or that we need a Slack channel; maybe an email list does; what do you think and prefer?

—For my status, I’m going to take a few days to review what we have (notably docs and 2019 chapter), and will follow up here. (I’m off from July 5–9, then, when I’m going to be slow or unavailable to respond to messages.)

iandevlin commented 4 years ago

Hi @j9t! I would also like to co-author, if possible. As for communication, happy to use whatever, although I do find Slack verfy useful (I use it all the time these days)

catalinred commented 4 years ago

Hi @j9t,

I made a study on markup in the past and I'd like to help with research and writing if that's possible as well.

Slack is part of my workflow already but I'm open to any alternative.

tunetheweb commented 4 years ago

Hope you don't mind me jumping in here.

I'm one of the core contributors for the almanac, working on development and translations, but I also wrote one of the chapters last year, and also copy edited a lot of the chapters last year.

First up I want to say that use whatever works for the team, so take all that I'm going to say with that in mind. However I would strongly encourage GitHub and GoogleDocs over Slack and Email for a lot of the comms. Because while we want to make collaboration as easy as possible for you, we also should bear in mind that people might join and drop out of this, and future years.

For example last year the Markup Chapter had these links:

5 - the equivalent to this issue where we picked the team and decided what metrics to look at
84 - where we worked on the metrics
The SQL we finally came up with for the metrics
The results
133 - where we tracked the writing of the chapters
The chapter in a Google Docs as it was worked on and comments were added

(Most of these are tracked in @rviscomi 's excellent PM sheet from last year).

As you can see there is a wealth of information that is available to you, 2020 authors, analysts and reviewers to help you for this year's chapters to potentially answer questions like why certain metrics weren't looked at last year (were they not considered? Or not possible? Or they were looked at but no interesting data so never made it into the chapter?). You can look at all the metrics from last year, and the results, and – perhaps more importantly – the discussions around them and then decide which ones to look at again this year, which to drop, and what new ones to add - using the above links to help inform you of those decisions.

That wouldn't be possible and a lot of valuable reasoning might be lost if these discussions happened over less linkable, searchable and plus+1-able mediums like Slack and/or Email.

It also means random people (like me here!) can stick our big noses in to try to help. Or you can @ people outside of the immediate chapter team (like @rviscomi @obto or myself) or pull in other people outside of the Web Almanac if you've a question for them.

On the other hand, there is a lot to be said for the interactivity for chat so totally understand if you want to go that direction. Just ask you to bear above in mind if you do.

A few other resources to be aware of:

We have a #web-almanac channel on slack. I encourage you to join, and I presume it's possible to open another channel dedicated to this chapter on that if needs be?
You can also have a discussion room on GitHub in the team space. Though we tried that for some of the translator teams last year and it wasn't used much to be honest. Not sure I see the benefit over direct discussion in the issues to be honest. But thought I'd throw it out there as an option.

Anyway, will let you all decide as a team but thought I'd throw my 2 cents in based on last year's experience - hope it helps!

zcorpan commented 4 years ago

@j9t

can you confirm that and how you’d like to be involved?

I sign up as reviewer.

I can't commit to the analyst role or author role, but I can help discuss ideas for things to include.

j9t commented 4 years ago

Thanks @bazzadp—this is excellent feedback! Thanks for jumping in :) This is good context and good information to review.

For communication, I signed up for httparchive Slack, and maybe we can indeed just open a channel there to coordinate.

@iandevlin, @catalinred, @zcorpan, great to hear—I’ll update the intro accordingly. @catalinred, would “research” reflect the analyst role?

catalinred commented 4 years ago

@catalinred, would “research” reflect the analyst role?

Actually, I was thinking about writing or reviewing. Can't help with the analyst role. Let me know where you want me.

ibnesayeed commented 4 years ago

@j9t please count me in as a reviewer. I would have offered to be a co-author, but my plate is too full this year.

j9t commented 4 years ago

Excellent! Thanks for confirming and clarifying, Catalin, Sawood. As we heard from everyone but him I reached out to Manuel directly to check on his interests.

As mentioned before, I’ll be off for a few days now but will use that time to review last year’s chapter as well as the docs. Maybe this is a good time for us all to do that? I have a look particularly at the introductory references as well as Barry’s comment.

Everyone, also @obto, @bazzadp, do you have ideas on the analyst role, and who could help with that? I can imagine that’s a bit of a special role in that maybe analysts for other topics could potentially help with, too, if they have bandwidth and are interested in the subject and helping us out?

tunetheweb commented 4 years ago

I’ve seen @rviscomi reaching out to the HTTP Archive community and others to try to get some more analysts. We had same issue last year and had to share them across chapters. If you know anyone with SQL skills willing to help out then give them a nudge 🙂

What you all can do to help in the meantime is to do the review as you suggested and come up with what metrics you would like to see. Then once we have an analyst they can write the actual SQL to query the HTTP Archive. The good thing is we have all of last years queries already so, assuming a lot of those are going to be reused, it hopefully won’t be as much work as last year.

Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.

What is more important is if you want anything new which is not in the current dataset then need that asked for before the August crawl we will be using. Of course having an analyst dedicated to this chapter who knows the current dataset would help you in figuring out if you need anything added, but pipe up here and @rviscomi , @obto , myself and anyone else that fancies can hopefully help answer any questions in the meantime.

Hope that helps!

ibnesayeed commented 4 years ago

@rviscomi some milestone entries in this and other chapters do not have chackboxes, is this intentional?

rviscomi commented 4 years ago

@ibnesayeed yes those are informational dates to explain why the other milestones are when they are. For example, there's nothing the content team needs to do to make sure the August crawl happens, but it explains why metrics must be prepared by July 27, and why querying metrics can't begin until September.

j9t commented 4 years ago

Hey everyone—after a quick vacation and reviewing the major docs, an update. I’ll keep it brief:

👉 I’ve added a first outline and metrics in the Markup draft doc. Can everyone please check they have access, and review, add, and comment, notably on what you think is missing metrics-wise?
👉 Does anyone of you feel we aren’t enough people (except for analysts, which we don’t have any yet)? Do you have particular people in mind we should contact? Feel free to just reach out to them or coordinate with me if that makes it easier.
I’ve pinged @matuzo again to check on his availability, but will take him off the list (intro section ↑) if we don’t hear anything affirmative. (If Manuel still finds the time that would be awesome, so I think we can still add him to the team later.)
👉 @catalinred, @iandevlin, do you have bandwidth to help me with chapter coordination? As this would be useful to discuss directly, I’ve also sent you an email, Catalin (I can’t spot yours, Ian). (If anyone else likes to help with this please let me know—I’ve simply thought to start with asking the co-authors :)
Regarding communication, it seems we can work like this right now but maybe it would still benefit us if we had a Markup channel in Slack. I’m on the fence but might as well open one—just so that you aren’t surprised if that happens :)
Is there anything else that comes to your mind right now?

iandevlin commented 4 years ago

@j9t Hello! My email is ian@iandevlin.com. And yes, I should have the bandwidth to help.

catalinred commented 4 years ago

Hi,

I read again @bkardell 's last year awesome piece and I think there is a lot of precious info to keep from (and compare to) last year's chapter, e.g.:

Top elements comparison: 2005/2019 and now 2020
Elements per page: frequency, average numbers
Top deprecated elements

Besides that, here are some of my thoughts, random things that I'd like to see in this new Markup chapter:

A doctype breakdown, would love to see how the latest HTML doctype hopefully crushes the other obsolete doctypes.
<link rel="icon"> stats - this will help us know how people are using the favicon nowadays SVG or PNG? If missing then we may assume they use the favicon.ico in the root maybe.
- <link rel="icon" href="/favicon.svg" type="image/svg+xml">
- <link rel="icon" href="/favicon.png" type="image/png">
When it comes to links, knowing that a missing rel="noopener" when using target="_blank" is considered a security vulnerability, we may show some stats on when and how these are used in the wild.
noscript usage stats - is this number affected by popular scripts as GTM, as it recommends you to paste a noscript element containing an iframe immediately after the opening tag?
rel=amphtml - might be interesting to know how many pages are linking to an AMP version using <link rel="amphtml">
the old and wrong way to stop links navigating by using href=”javascript:void(0) - would love to find nothing here, though.
How many pages are correctly adding a lang attribute for the html element and are they using the <link hreflang=""> to specify the language of the document too?
I'm deeply interested to see how many buttons are being used without a specified type. Also, what are the stats regarding the other native <button>, <input type=image or <input type=button?
How about the popular data-* custom attributes? Curious to see the numbers and based on their naming we find we might draw some interesting conclusions about their purpose.
<meta name="generator" content="WordPress"> - people don't remove this from the <head> so we might find out how many of the pages we analyze are actual WordPress pages.
There is some confusion when it comes to HTML replaced and void elements, and seeing some actual stats might help understand them better. On the void elements, maybe some stats whether people are closing them or not, e.g. <img src=""> or <img src=""/>.
Besides the already popular custom elements stats, I'd like to see how many people are using <h7> or <h8> in their HTML. Last time I checked within ~8mil pages, I found more than 20K <h7> elements.
Video and audio autoplay is considered a bad practice so it would be interesting to see the elements and values for the autoplay attribute in the wild: autoplay, autoplay=true and autoplay=false
Ways of including SVG in HTML, inline SVG, as an <img>, as an <object>, as an <embed>, as an <iframe>. Count the top five or top ten SVG elements (when speaking of inline SVG).
Knowing that a perfectly valid HTML page doesn't need a body at all, curious if we can we find any HTML without a body? It would be interesting to see the results here.
HTML document size, the average, lowest, and highest size we can find in the HTTPArchive set.
Last but not least, would we be able to HTML validate the whole data set? I'm guessing this would be cumbersome but would love to see numbers, I bet a lot of the pages are not valid, e.g. https://validator.w3.org/nu/?doc=https://twitter.com

Let me know your thoughts!

foxdavidj commented 4 years ago

Hey @j9t, looks like things are moving along pretty smoothly. Is there anything you need from me to keep things moving forward, and have the chapter outline and metrics settled on by the end of the week?

j9t commented 4 years ago

Everyone, please share your thoughts on the first outline as well as metrics to look at in our Markup doc—thank you!

@iandevlin, excellent! I’ve followed up per email.

@catalinred, I love it! I’ve synced your ideas with the doc’s “Metrics” section, adding what was not there yet under “Patterns” (we can rename). Please edit/comment where you see fit!

@obto, thanks for checking in! We do have a skeleton available with the draft doc—do you have any feedback on that? It doesn’t seem we have an analyst coming out of our group—do you have thoughts on that, to look into closer coordination?

zcorpan commented 4 years ago

I've requested edit access. Writing here for now.

On doctypes, I think an interesting metric is an accurate number of how many pages are in quirks mode (and how it has changed over time). There is a custom metric for this already, see https://github.com/HTTPArchive/httparchive.org/issues/186#issuecomment-655611526

foxdavidj commented 4 years ago

@j9t The document looks great. I especially like the categorization of the metrics you've put together.

As for finding an analyst, it's looking like we'll have to share them between chapters like we did last year. We are actively looking though and @rviscomi has reached out in a few places.

I'm happy to be a stand-in if we cant find some more analysts soon however :)

matuzo commented 4 years ago

@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!

Hey, sorry, I was on vacation and I didn't check my mails. I'd love to contribute as a reviewer. :)

Thanks!

j9t commented 4 years ago

@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)

ibnesayeed commented 4 years ago

I have left a few comments and suggestions for inclusion in the draft.

catalinred commented 4 years ago

@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)

That’s a bit odd because I did reply to you :)

j9t commented 4 years ago

That’s a bit odd because I did reply to you :)

Strange. Checked again, received literally nothing. Could you forward your response to jens@meiert.org?

bkardell commented 4 years ago

Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.

Yeah I should add we can get new data dumps and integrate them for that tool rather quickly, it's still currently manual - but a lot more can be done if anyone is interested in developing this tool further either now or over time... For example, one thing I added since was an endpoint that will let you analyze the differences between two dataset dumps http://rainy-periwinkle.glitch.me/delta/desktop/aug_2019/march_2020

j9t commented 4 years ago

Quick status:

➕ The main document is up and has already received some feedback. Thanks everyone who has taken a look—everyone who hasn’t yet, please check it out, too, and share your thoughts!
➖ For the moment I still consider us a bit weak on the analysis side. This may nothing to be worried about yet given that there seems to be help by the project (and we’re watched over 🙏), however if anyone could have a closer look at this, including Brian’s and others’ feedback, and maybe drive this, that could be a great help.
Catalin, Ian, and I are going to meet up tomorrow (Monday) to sync up and discuss the doc in person. We’re going to follow up again here.

j9t commented 4 years ago

Quick status update:

We have a private Slack channel #markup-2020 now. @zcorpan, @matuzo, I couldn’t find you in Slack yet but please hit me up so to add you—the same applies to everyone who would work with us. (It’s a private channel as public feedback can already be covered here.)
The metrics section seemed in need of refinement. I set up a spreadsheet to be clearer about the questions we try to answer and the data to retrieve for that. I’ve begun to reflect our doc’s metrics there; @catalinred, since the “patterns” section includes most of your suggestions, can you help and transfer those over (to complete the “Question to answer” column)?

Once we have moved the metrics I hope we can have better conversation around the data needed as well as what we deem the most important items to cover (as Catalin’s own analyses show, element popularity alone is already a huge topic 😊).

rviscomi commented 4 years ago

@HTTPArchive/analysts this chapter is in need of your help!

Tiggerito commented 4 years ago

I'm doing analytics for SEO and noticed there is quite an overlap in the metrics needed. I could help by including what you need in the metrics we gather. e.g. we already get all link and meta tags which would cover some of your requirements. We also gather data on links, hXs and images.

rviscomi commented 4 years ago

Thanks @Tiggerito! Can I put you down as this chapter's designated analyst, or are you only able to help with custom metrics? The custom metrics are the highest priority so even that would be greatly appreciated.

Tiggerito commented 4 years ago

Put me down. With my metrics and SQL pull requests (getting big), should I add in what's needed with this chapter?

rviscomi commented 4 years ago

Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.

Tiggerito commented 4 years ago

Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.

Great, I'll rename the pull requests to reflect that.

j9t commented 4 years ago

Great to have you here, @Tiggerito!

Going through the metrics sheet I think we transferred almost all data, which I crossed out in the doc. @ibnesayeed, @catalinred, can you check on and move the few items that don’t seem to have been transferred over yet (javascript links, link targets, boilerplate)?

I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?

Unless I’m missing something important I’d propose proceeding from two ends:

@catalinred, @iandevlin, can you help me clean up the metrics sheet, like adding information on data needed, regrouping, tweaking comments? I’d focus on that over the next two days (I can’t invest much time at once right now).
@Tiggerito, can you share your view on the metrics in terms of how suitable they are to work with? That can inform our normalization work and maybe put our minds a bit at ease that we’ll get a look at the data we need.

Call me out on and excuse anything I’m missing please, and then thanks everyone who has invested in reviewing and documenting over the last few days!

Tiggerito commented 4 years ago

Hi @j9t and all,

I'll read through stuff to get myself up to speed and see if anything was missed.

I've already pulled over the old SQL scripts. That helped me get a grasp what was done last year, and an idea on what can be done.

First glance on the metrics sheet and I think we may have a few that are not viable. Mainly on the checking validity and style of the html. Most of our data comes from the DOM of a rendered crawl, which means we are looking at parsed and cleaned up html. Syntax errors handled, formatting like type of quote removed, if a tag was specified as self closing or an empty open/close etc.

The rest look like they can be gathered from the DOM, WebPageTest data, or other data gathered during the crawl.

We can peek at the raw html, however that query is very expensive (I think $80 a go). And it's very hard to process it as we only have basic string manipulation like regex to work with. It's recommended that we avoid these sorts of queries. I don't think we would be able to do syntax type tests on it anyhow.

The critical bit at the moment is to make sure we get the almanac.js file updated so it captures all the data needed. We have a week to get that code implemented tested and approved. I'll focus on the sheet and add notes about viability and if almanac.js is involved, what state that is in.

We also have access to Lighthouse data. I'm not sure what it reports on related to markup at this time.

And a Technologies table that may come in use. e.g. to identify WordPress for us.

Tiggerito commented 4 years ago

I've gone through the spreadsheet and added a few new columns and comments/questions.

catalinred commented 4 years ago

I've gone through the spreadsheet and added a few new columns and comments/questions.

Thanks, @Tiggerito, I'll add my thoughts/answers to the spreadsheet as well.

j9t commented 4 years ago

Thanks @Tiggerito, @catalinred!

What's the most pressing we need right now?

(I’m a bit behind in terms of reviewing and working on the metrics again but it’s #1 on my list.)

rviscomi commented 4 years ago

I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?

Yes, if I understand the question correctly, you can use the results of the metrics triaging to revisit the chapter outline and adjust as needed. The biggest rush is accounting for any custom metrics that need to be added before the August 1 crawl. You could continue revising the outline throughout August before the analysis is done. It's a good idea to lock in the scope of the chapter for the sake of not introducing any newly required metrics, but if you and @Tiggerito agree to a bit of scope creep and it doesn't require going back and adding custom metrics, that's ok too. Hope that answers your question!

catalinred commented 4 years ago

What's the most pressing we need right now?

In my opinion, we should first decide on the approximate chapter categorization.
After deciding on the matter above, we need to carefully study all the suggested metrics in the sheet and decide which one are really important, as we might need extra help from the @HTTPArchive/analysts team because @Tiggerito said he might not have enough time to invest in all the metrics we're looking for.

Tiggerito commented 4 years ago

I've got what I think is an almost complete version of the script to get the data. I think there are still a few open items, but most things have been implemented.

Here is a zipped up example of the data:

simonelectric.zip

If you provide me with test URLs and what you expect, I can run the tests.

j9t commented 4 years ago

In my opinion, we should first decide on the approximate chapter categorization.

I added categories and made some other updates to the metrics spreadsheet that I hope are useful here, and help us get a more detailed view at what we try to do. (This is not to distract from @Tiggerito’s and @catalinred’s work, who have both added more substantial value over the last few days.)

I also adjusted the doc outline a little bit, aligning it with the metrics and thereby simplifying the structure (general, elements, attributes). I’m not sure we need subsections at this point, but that these depend on what we can measure and then decide to interpret.

(Completely open to discuss and revisit all of this—please share your views.)

@Tiggerito, @catalinred, it seems we won’t get data for everything we would like to look at, but can already cover a good number of metrics. Can you help assess whether we should spend more time on refining, or go for it?

HTTPArchive / almanac.httparchive.org

Markup 2020 #899

Part I Chapter 3: Markup

Content team

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

5 - the equivalent to this issue where we picked the team and decided what metrics to look at

84 - where we worked on the metrics

133 - where we tracked the writing of the chapters