HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
621 stars 182 forks source link

Markup 2020 #899

Closed foxdavidj closed 4 years ago

foxdavidj commented 4 years ago

Part I Chapter 3: Markup

Content team

Authors Reviewers Analysts Draft Queries Results
@j9t @catalinred @iandevlin @zcorpan @matuzo @bkardell @Tiggerito Doc *.sql Sheet

Content team lead: @j9t

Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.

The content team is made up of the following contributors:

New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.

Note: To ensure that you get notifications when tagged, you must be "watching" this repository.

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

borisschapira commented 4 years ago

I'm afraid I'm not fluent enough in English to create the content. I can help reviewing, though.

j9t commented 4 years ago

I’m happy to contribute to this. With my current workload the least I can commit to is reviewing—I look forward to coordinating with whoever would be driving this section.

ibnesayeed commented 4 years ago

I can review this chapter.

catalinred commented 4 years ago

I'd love to help in any way.

zcorpan commented 4 years ago

I can contribute with ideas for things to query, and review.

iandevlin commented 4 years ago

I would be happy to contribute and/or review.

foxdavidj commented 4 years ago

@j9t thank you for agreeing to be the lead author for the Markup chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.

The immediate next steps for this chapter are:

  1. Establish the rest of your content team. Several other people were interested or nominated (see below), so that's a great place to start. The larger the scope of the chapter, the more people you'll want to have on board.
  2. Start sketching out ideas in your draft doc.
  3. Catch up on last year's chapter and the project methodology to get a sense for what's possible.

There's a ton of info in the top comment, so check that out and feel free to ping myself or @rviscomi with any questions!

foxdavidj commented 4 years ago

@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!

@iandevlin @catalinred @ibnesayeed @zcorpan @iandevlin I've put you down as reviewers for now, and will leave it to @j9t to reassign at his discretion

j9t commented 4 years ago

Thanks @obto—I’m excited to work on this together with all of you who have also expressed interest! 🙏

👉 @iandevlin, @catalinred, @ibnesayeed, @zcorpan, @iandevlin, @matuzo, can you confirm that and how you’d like to be involved? Who would also like to write and co-author, who would like to cover analysis? I like the idea of forming a really strong team together. (Feel free to respond here but also directly through email, as per my profile.)

(If everyone of you is aboard, and if we can split the responsibilities well I think we already have a good setup. I’d wait until all of you confirmed to decide with you whether that’s the case or whether we need more support.)

👉 Do you have preferences for how to coordinate? Not everything will be useful to discuss in this thread; I’m not sure there is or that we need a Slack channel; maybe an email list does; what do you think and prefer?

—For my status, I’m going to take a few days to review what we have (notably docs and 2019 chapter), and will follow up here. (I’m off from July 5–9, then, when I’m going to be slow or unavailable to respond to messages.)

iandevlin commented 4 years ago

Hi @j9t! I would also like to co-author, if possible. As for communication, happy to use whatever, although I do find Slack verfy useful (I use it all the time these days)

catalinred commented 4 years ago

Hi @j9t,

I made a study on markup in the past and I'd like to help with research and writing if that's possible as well.

Slack is part of my workflow already but I'm open to any alternative.

tunetheweb commented 4 years ago

Hope you don't mind me jumping in here.

I'm one of the core contributors for the almanac, working on development and translations, but I also wrote one of the chapters last year, and also copy edited a lot of the chapters last year.

First up I want to say that use whatever works for the team, so take all that I'm going to say with that in mind. However I would strongly encourage GitHub and GoogleDocs over Slack and Email for a lot of the comms. Because while we want to make collaboration as easy as possible for you, we also should bear in mind that people might join and drop out of this, and future years.

For example last year the Markup Chapter had these links:

(Most of these are tracked in @rviscomi 's excellent PM sheet from last year).

As you can see there is a wealth of information that is available to you, 2020 authors, analysts and reviewers to help you for this year's chapters to potentially answer questions like why certain metrics weren't looked at last year (were they not considered? Or not possible? Or they were looked at but no interesting data so never made it into the chapter?). You can look at all the metrics from last year, and the results, and – perhaps more importantly – the discussions around them and then decide which ones to look at again this year, which to drop, and what new ones to add - using the above links to help inform you of those decisions.

That wouldn't be possible and a lot of valuable reasoning might be lost if these discussions happened over less linkable, searchable and plus+1-able mediums like Slack and/or Email.

It also means random people (like me here!) can stick our big noses in to try to help. Or you can @ people outside of the immediate chapter team (like @rviscomi @obto or myself) or pull in other people outside of the Web Almanac if you've a question for them.

On the other hand, there is a lot to be said for the interactivity for chat so totally understand if you want to go that direction. Just ask you to bear above in mind if you do.

A few other resources to be aware of:

Anyway, will let you all decide as a team but thought I'd throw my 2 cents in based on last year's experience - hope it helps!

zcorpan commented 4 years ago

@j9t

can you confirm that and how you’d like to be involved?

I sign up as reviewer.

I can't commit to the analyst role or author role, but I can help discuss ideas for things to include.

j9t commented 4 years ago

Thanks @bazzadp—this is excellent feedback! Thanks for jumping in :) This is good context and good information to review.

For communication, I signed up for httparchive Slack, and maybe we can indeed just open a channel there to coordinate.

@iandevlin, @catalinred, @zcorpan, great to hear—I’ll update the intro accordingly. @catalinred, would “research” reflect the analyst role?

catalinred commented 4 years ago

@catalinred, would “research” reflect the analyst role?

Actually, I was thinking about writing or reviewing. Can't help with the analyst role. Let me know where you want me.

ibnesayeed commented 4 years ago

@j9t please count me in as a reviewer. I would have offered to be a co-author, but my plate is too full this year.

j9t commented 4 years ago

Excellent! Thanks for confirming and clarifying, Catalin, Sawood. As we heard from everyone but him I reached out to Manuel directly to check on his interests.

As mentioned before, I’ll be off for a few days now but will use that time to review last year’s chapter as well as the docs. Maybe this is a good time for us all to do that? I have a look particularly at the introductory references as well as Barry’s comment.

Everyone, also @obto, @bazzadp, do you have ideas on the analyst role, and who could help with that? I can imagine that’s a bit of a special role in that maybe analysts for other topics could potentially help with, too, if they have bandwidth and are interested in the subject and helping us out?

tunetheweb commented 4 years ago

I’ve seen @rviscomi reaching out to the HTTP Archive community and others to try to get some more analysts. We had same issue last year and had to share them across chapters. If you know anyone with SQL skills willing to help out then give them a nudge 🙂

What you all can do to help in the meantime is to do the review as you suggested and come up with what metrics you would like to see. Then once we have an analyst they can write the actual SQL to query the HTTP Archive. The good thing is we have all of last years queries already so, assuming a lot of those are going to be reused, it hopefully won’t be as much work as last year.

Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.

What is more important is if you want anything new which is not in the current dataset then need that asked for before the August crawl we will be using. Of course having an analyst dedicated to this chapter who knows the current dataset would help you in figuring out if you need anything added, but pipe up here and @rviscomi , @obto , myself and anyone else that fancies can hopefully help answer any questions in the meantime.

Hope that helps!

ibnesayeed commented 4 years ago

@rviscomi some milestone entries in this and other chapters do not have chackboxes, is this intentional?

rviscomi commented 4 years ago

@ibnesayeed yes those are informational dates to explain why the other milestones are when they are. For example, there's nothing the content team needs to do to make sure the August crawl happens, but it explains why metrics must be prepared by July 27, and why querying metrics can't begin until September.

j9t commented 4 years ago

Hey everyone—after a quick vacation and reviewing the major docs, an update. I’ll keep it brief:

iandevlin commented 4 years ago

@j9t Hello! My email is ian@iandevlin.com. And yes, I should have the bandwidth to help.

catalinred commented 4 years ago

Hi,

I read again @bkardell 's last year awesome piece and I think there is a lot of precious info to keep from (and compare to) last year's chapter, e.g.:

Besides that, here are some of my thoughts, random things that I'd like to see in this new Markup chapter:

Let me know your thoughts!

foxdavidj commented 4 years ago

Hey @j9t, looks like things are moving along pretty smoothly. Is there anything you need from me to keep things moving forward, and have the chapter outline and metrics settled on by the end of the week?

j9t commented 4 years ago

Everyone, please share your thoughts on the first outline as well as metrics to look at in our Markup doc—thank you!

@iandevlin, excellent! I’ve followed up per email.

@catalinred, I love it! I’ve synced your ideas with the doc’s “Metrics” section, adding what was not there yet under “Patterns” (we can rename). Please edit/comment where you see fit!

@obto, thanks for checking in! We do have a skeleton available with the draft doc—do you have any feedback on that? It doesn’t seem we have an analyst coming out of our group—do you have thoughts on that, to look into closer coordination?

zcorpan commented 4 years ago

I've requested edit access. Writing here for now.

On doctypes, I think an interesting metric is an accurate number of how many pages are in quirks mode (and how it has changed over time). There is a custom metric for this already, see https://github.com/HTTPArchive/httparchive.org/issues/186#issuecomment-655611526

foxdavidj commented 4 years ago

@j9t The document looks great. I especially like the categorization of the metrics you've put together.

As for finding an analyst, it's looking like we'll have to share them between chapters like we did last year. We are actively looking though and @rviscomi has reached out in a few places.

I'm happy to be a stand-in if we cant find some more analysts soon however :)

matuzo commented 4 years ago

@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!

Hey, sorry, I was on vacation and I didn't check my mails. I'd love to contribute as a reviewer. :)

Thanks!

j9t commented 4 years ago

@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)

ibnesayeed commented 4 years ago

I have left a few comments and suggestions for inclusion in the draft.

catalinred commented 4 years ago

@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)

That’s a bit odd because I did reply to you :)

j9t commented 4 years ago

That’s a bit odd because I did reply to you :)

Strange. Checked again, received literally nothing. Could you forward your response to jens@meiert.org?

bkardell commented 4 years ago

Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.

Yeah I should add we can get new data dumps and integrate them for that tool rather quickly, it's still currently manual - but a lot more can be done if anyone is interested in developing this tool further either now or over time... For example, one thing I added since was an endpoint that will let you analyze the differences between two dataset dumps http://rainy-periwinkle.glitch.me/delta/desktop/aug_2019/march_2020

j9t commented 4 years ago

Quick status:

j9t commented 4 years ago

Quick status update:

Once we have moved the metrics I hope we can have better conversation around the data needed as well as what we deem the most important items to cover (as Catalin’s own analyses show, element popularity alone is already a huge topic 😊).

rviscomi commented 4 years ago

@HTTPArchive/analysts this chapter is in need of your help!

Tiggerito commented 4 years ago

I'm doing analytics for SEO and noticed there is quite an overlap in the metrics needed. I could help by including what you need in the metrics we gather. e.g. we already get all link and meta tags which would cover some of your requirements. We also gather data on links, hXs and images.

rviscomi commented 4 years ago

Thanks @Tiggerito! Can I put you down as this chapter's designated analyst, or are you only able to help with custom metrics? The custom metrics are the highest priority so even that would be greatly appreciated.

Tiggerito commented 4 years ago

Put me down. With my metrics and SQL pull requests (getting big), should I add in what's needed with this chapter?

rviscomi commented 4 years ago

Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.

Tiggerito commented 4 years ago

Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.

Great, I'll rename the pull requests to reflect that.

j9t commented 4 years ago

Great to have you here, @Tiggerito!

Going through the metrics sheet I think we transferred almost all data, which I crossed out in the doc. @ibnesayeed, @catalinred, can you check on and move the few items that don’t seem to have been transferred over yet (javascript links, link targets, boilerplate)?

I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?

Unless I’m missing something important I’d propose proceeding from two ends:

  1. @catalinred, @iandevlin, can you help me clean up the metrics sheet, like adding information on data needed, regrouping, tweaking comments? I’d focus on that over the next two days (I can’t invest much time at once right now).
  2. @Tiggerito, can you share your view on the metrics in terms of how suitable they are to work with? That can inform our normalization work and maybe put our minds a bit at ease that we’ll get a look at the data we need.

Call me out on and excuse anything I’m missing please, and then thanks everyone who has invested in reviewing and documenting over the last few days!

Tiggerito commented 4 years ago

Hi @j9t and all,

I'll read through stuff to get myself up to speed and see if anything was missed.

I've already pulled over the old SQL scripts. That helped me get a grasp what was done last year, and an idea on what can be done.

First glance on the metrics sheet and I think we may have a few that are not viable. Mainly on the checking validity and style of the html. Most of our data comes from the DOM of a rendered crawl, which means we are looking at parsed and cleaned up html. Syntax errors handled, formatting like type of quote removed, if a tag was specified as self closing or an empty open/close etc.

The rest look like they can be gathered from the DOM, WebPageTest data, or other data gathered during the crawl.

We can peek at the raw html, however that query is very expensive (I think $80 a go). And it's very hard to process it as we only have basic string manipulation like regex to work with. It's recommended that we avoid these sorts of queries. I don't think we would be able to do syntax type tests on it anyhow.

The critical bit at the moment is to make sure we get the almanac.js file updated so it captures all the data needed. We have a week to get that code implemented tested and approved. I'll focus on the sheet and add notes about viability and if almanac.js is involved, what state that is in.

We also have access to Lighthouse data. I'm not sure what it reports on related to markup at this time.

And a Technologies table that may come in use. e.g. to identify WordPress for us.

Tiggerito commented 4 years ago

I've gone through the spreadsheet and added a few new columns and comments/questions.

catalinred commented 4 years ago

I've gone through the spreadsheet and added a few new columns and comments/questions.

Thanks, @Tiggerito, I'll add my thoughts/answers to the spreadsheet as well.

j9t commented 4 years ago

Thanks @Tiggerito, @catalinred!

What's the most pressing we need right now?

(I’m a bit behind in terms of reviewing and working on the metrics again but it’s #1 on my list.)

rviscomi commented 4 years ago

I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?

Yes, if I understand the question correctly, you can use the results of the metrics triaging to revisit the chapter outline and adjust as needed. The biggest rush is accounting for any custom metrics that need to be added before the August 1 crawl. You could continue revising the outline throughout August before the analysis is done. It's a good idea to lock in the scope of the chapter for the sake of not introducing any newly required metrics, but if you and @Tiggerito agree to a bit of scope creep and it doesn't require going back and adding custom metrics, that's ok too. Hope that answers your question!

catalinred commented 4 years ago

What's the most pressing we need right now?

Tiggerito commented 4 years ago

I've got what I think is an almost complete version of the script to get the data. I think there are still a few open items, but most things have been implemented.

Here is a zipped up example of the data:

simonelectric.zip

If you provide me with test URLs and what you expect, I can run the tests.

j9t commented 4 years ago

In my opinion, we should first decide on the approximate chapter categorization.

I added categories and made some other updates to the metrics spreadsheet that I hope are useful here, and help us get a more detailed view at what we try to do. (This is not to distract from @Tiggerito’s and @catalinred’s work, who have both added more substantial value over the last few days.)

I also adjusted the doc outline a little bit, aligning it with the metrics and thereby simplifying the structure (general, elements, attributes). I’m not sure we need subsections at this point, but that these depend on what we can measure and then decide to interpret.

(Completely open to discuss and revisit all of this—please share your views.)

@Tiggerito, @catalinred, it seems we won’t get data for everything we would like to look at, but can already cover a good number of metrics. Can you help assess whether we should spend more time on refining, or go for it?