Closed foxdavidj closed 4 years ago
I'm afraid I'm not fluent enough in English to create the content. I can help reviewing, though.
I’m happy to contribute to this. With my current workload the least I can commit to is reviewing—I look forward to coordinating with whoever would be driving this section.
I can review this chapter.
I'd love to help in any way.
I can contribute with ideas for things to query, and review.
I would be happy to contribute and/or review.
@j9t thank you for agreeing to be the lead author for the Markup chapter! As the lead, you'll be responsible for driving the content planning and writing phases in collaboration with your content team, which will consist of yourself as lead, any coauthors you choose as needed, peer reviewers, and data analysts.
The immediate next steps for this chapter are:
There's a ton of info in the top comment, so check that out and feel free to ping myself or @rviscomi with any questions!
@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!
@iandevlin @catalinred @ibnesayeed @zcorpan @iandevlin I've put you down as reviewers for now, and will leave it to @j9t to reassign at his discretion
Thanks @obto—I’m excited to work on this together with all of you who have also expressed interest! 🙏
👉 @iandevlin, @catalinred, @ibnesayeed, @zcorpan, @iandevlin, @matuzo, can you confirm that and how you’d like to be involved? Who would also like to write and co-author, who would like to cover analysis? I like the idea of forming a really strong team together. (Feel free to respond here but also directly through email, as per my profile.)
(If everyone of you is aboard, and if we can split the responsibilities well I think we already have a good setup. I’d wait until all of you confirmed to decide with you whether that’s the case or whether we need more support.)
👉 Do you have preferences for how to coordinate? Not everything will be useful to discuss in this thread; I’m not sure there is or that we need a Slack channel; maybe an email list does; what do you think and prefer?
—For my status, I’m going to take a few days to review what we have (notably docs and 2019 chapter), and will follow up here. (I’m off from July 5–9, then, when I’m going to be slow or unavailable to respond to messages.)
Hi @j9t! I would also like to co-author, if possible. As for communication, happy to use whatever, although I do find Slack verfy useful (I use it all the time these days)
Hi @j9t,
I made a study on markup in the past and I'd like to help with research and writing if that's possible as well.
Slack is part of my workflow already but I'm open to any alternative.
Hope you don't mind me jumping in here.
I'm one of the core contributors for the almanac, working on development and translations, but I also wrote one of the chapters last year, and also copy edited a lot of the chapters last year.
First up I want to say that use whatever works for the team, so take all that I'm going to say with that in mind. However I would strongly encourage GitHub and GoogleDocs over Slack and Email for a lot of the comms. Because while we want to make collaboration as easy as possible for you, we also should bear in mind that people might join and drop out of this, and future years.
For example last year the Markup Chapter had these links:
(Most of these are tracked in @rviscomi 's excellent PM sheet from last year).
As you can see there is a wealth of information that is available to you, 2020 authors, analysts and reviewers to help you for this year's chapters to potentially answer questions like why certain metrics weren't looked at last year (were they not considered? Or not possible? Or they were looked at but no interesting data so never made it into the chapter?). You can look at all the metrics from last year, and the results, and – perhaps more importantly – the discussions around them and then decide which ones to look at again this year, which to drop, and what new ones to add - using the above links to help inform you of those decisions.
That wouldn't be possible and a lot of valuable reasoning might be lost if these discussions happened over less linkable, searchable and plus+1-able mediums like Slack and/or Email.
It also means random people (like me here!) can stick our big noses in to try to help. Or you can @ people outside of the immediate chapter team (like @rviscomi @obto or myself) or pull in other people outside of the Web Almanac if you've a question for them.
On the other hand, there is a lot to be said for the interactivity for chat so totally understand if you want to go that direction. Just ask you to bear above in mind if you do.
A few other resources to be aware of:
#web-almanac
channel on slack. I encourage you to join, and I presume it's possible to open another channel dedicated to this chapter on that if needs be?Anyway, will let you all decide as a team but thought I'd throw my 2 cents in based on last year's experience - hope it helps!
@j9t
can you confirm that and how you’d like to be involved?
I sign up as reviewer.
I can't commit to the analyst role or author role, but I can help discuss ideas for things to include.
Thanks @bazzadp—this is excellent feedback! Thanks for jumping in :) This is good context and good information to review.
For communication, I signed up for httparchive Slack, and maybe we can indeed just open a channel there to coordinate.
@iandevlin, @catalinred, @zcorpan, great to hear—I’ll update the intro accordingly. @catalinred, would “research” reflect the analyst role?
@catalinred, would “research” reflect the analyst role?
Actually, I was thinking about writing or reviewing. Can't help with the analyst role. Let me know where you want me.
@j9t please count me in as a reviewer. I would have offered to be a co-author, but my plate is too full this year.
Excellent! Thanks for confirming and clarifying, Catalin, Sawood. As we heard from everyone but him I reached out to Manuel directly to check on his interests.
As mentioned before, I’ll be off for a few days now but will use that time to review last year’s chapter as well as the docs. Maybe this is a good time for us all to do that? I have a look particularly at the introductory references as well as Barry’s comment.
Everyone, also @obto, @bazzadp, do you have ideas on the analyst role, and who could help with that? I can imagine that’s a bit of a special role in that maybe analysts for other topics could potentially help with, too, if they have bandwidth and are interested in the subject and helping us out?
I’ve seen @rviscomi reaching out to the HTTP Archive community and others to try to get some more analysts. We had same issue last year and had to share them across chapters. If you know anyone with SQL skills willing to help out then give them a nudge 🙂
What you all can do to help in the meantime is to do the review as you suggested and come up with what metrics you would like to see. Then once we have an analyst they can write the actual SQL to query the HTTP Archive. The good thing is we have all of last years queries already so, assuming a lot of those are going to be reused, it hopefully won’t be as much work as last year.
Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.
What is more important is if you want anything new which is not in the current dataset then need that asked for before the August crawl we will be using. Of course having an analyst dedicated to this chapter who knows the current dataset would help you in figuring out if you need anything added, but pipe up here and @rviscomi , @obto , myself and anyone else that fancies can hopefully help answer any questions in the meantime.
Hope that helps!
@rviscomi some milestone entries in this and other chapters do not have chackboxes, is this intentional?
@ibnesayeed yes those are informational dates to explain why the other milestones are when they are. For example, there's nothing the content team needs to do to make sure the August crawl happens, but it explains why metrics must be prepared by July 27, and why querying metrics can't begin until September.
Hey everyone—after a quick vacation and reviewing the major docs, an update. I’ll keep it brief:
👉 I’ve added a first outline and metrics in the Markup draft doc. Can everyone please check they have access, and review, add, and comment, notably on what you think is missing metrics-wise?
👉 Does anyone of you feel we aren’t enough people (except for analysts, which we don’t have any yet)? Do you have particular people in mind we should contact? Feel free to just reach out to them or coordinate with me if that makes it easier.
I’ve pinged @matuzo again to check on his availability, but will take him off the list (intro section ↑) if we don’t hear anything affirmative. (If Manuel still finds the time that would be awesome, so I think we can still add him to the team later.)
👉 @catalinred, @iandevlin, do you have bandwidth to help me with chapter coordination? As this would be useful to discuss directly, I’ve also sent you an email, Catalin (I can’t spot yours, Ian). (If anyone else likes to help with this please let me know—I’ve simply thought to start with asking the co-authors :)
Regarding communication, it seems we can work like this right now but maybe it would still benefit us if we had a Markup channel in Slack. I’m on the fence but might as well open one—just so that you aren’t surprised if that happens :)
Is there anything else that comes to your mind right now?
@j9t Hello! My email is ian@iandevlin.com. And yes, I should have the bandwidth to help.
Hi,
I read again @bkardell 's last year awesome piece and I think there is a lot of precious info to keep from (and compare to) last year's chapter, e.g.:
Besides that, here are some of my thoughts, random things that I'd like to see in this new Markup chapter:
doctype
breakdown, would love to see how the latest HTML doctype hopefully crushes the other obsolete doctypes.<link rel="icon">
stats - this will help us know how people are using the favicon nowadays
SVG or PNG? If missing then we may assume they use the favicon.ico
in the root maybe.
<link rel="icon" href="/favicon.svg" type="image/svg+xml">
<link rel="icon" href="/favicon.png" type="image/png">
rel="noopener"
when using target="_blank"
is considered a security vulnerability, we may show some stats on when and how these are used in the wild.noscript
usage stats - is this number affected by popular script
s as GTM, as it recommends you to paste a noscript
element containing an iframe
immediately after the opening tag?rel=amphtml
- might be interesting to know how many pages are linking to an AMP version using <link rel="amphtml">
href=”javascript:void(0)
- would love to find nothing here, though.lang
attribute for the html
element and are they using the <link hreflang="">
to specify the language of the document too? button
s are being used without a specified type
. Also, what are the stats regarding the other native <button>
, <input type=image
or <input type=button
?data-*
custom attributes? Curious to see the numbers and based on their naming we find we might draw some interesting conclusions about their purpose.<meta name="generator" content="WordPress">
- people don't remove this from the <head>
so we might find out how many of the pages we analyze are actual WordPress pages.<img src="">
or <img src=""/>
.<h7>
or <h8>
in their HTML. Last time I checked within ~8mil pages, I found more than 20K <h7>
elements.autoplay
is considered a bad practice so it would be interesting to see the elements and values for the autoplay
attribute in the wild: autoplay
, autoplay=true
and autoplay=false
<img>
, as an <object>
, as an <embed>
, as an <iframe>
. Count the top five or top ten SVG elements (when speaking of inline SVG).body
at all, curious if we can we find any HTML without a body
? It would be interesting to see the results here.Let me know your thoughts!
Hey @j9t, looks like things are moving along pretty smoothly. Is there anything you need from me to keep things moving forward, and have the chapter outline and metrics settled on by the end of the week?
Everyone, please share your thoughts on the first outline as well as metrics to look at in our Markup doc—thank you!
@iandevlin, excellent! I’ve followed up per email.
@catalinred, I love it! I’ve synced your ideas with the doc’s “Metrics” section, adding what was not there yet under “Patterns” (we can rename). Please edit/comment where you see fit!
@obto, thanks for checking in! We do have a skeleton available with the draft doc—do you have any feedback on that? It doesn’t seem we have an analyst coming out of our group—do you have thoughts on that, to look into closer coordination?
I've requested edit access. Writing here for now.
On doctypes, I think an interesting metric is an accurate number of how many pages are in quirks mode (and how it has changed over time). There is a custom metric for this already, see https://github.com/HTTPArchive/httparchive.org/issues/186#issuecomment-655611526
@j9t The document looks great. I especially like the categorization of the metrics you've put together.
As for finding an analyst, it's looking like we'll have to share them between chapters like we did last year. We are actively looking though and @rviscomi has reached out in a few places.
I'm happy to be a stand-in if we cant find some more analysts soon however :)
@matuzo we'd still love to have you contribute as a peer reviewer or coauthor as needed. Let us know if you're still interested!
Hey, sorry, I was on vacation and I didn't check my mails. I'd love to contribute as a reviewer. :)
Thanks!
@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)
I have left a few comments and suggestions for inclusion in the draft.
@catalinred—I had contacted you through email; could you check or otherwise let me know how to best coordinate directly? Thanks :)
That’s a bit odd because I did reply to you :)
That’s a bit odd because I did reply to you :)
Strange. Checked again, received literally nothing. Could you forward your response to jens@meiert.org?
Last year’s author of this chapter @bkardell also created a glitch app to allow you to explore the data without requiring SQL skills so that would be worth checking out too to research and validate ideas.
Yeah I should add we can get new data dumps and integrate them for that tool rather quickly, it's still currently manual - but a lot more can be done if anyone is interested in developing this tool further either now or over time... For example, one thing I added since was an endpoint that will let you analyze the differences between two dataset dumps http://rainy-periwinkle.glitch.me/delta/desktop/aug_2019/march_2020
Quick status:
➕ The main document is up and has already received some feedback. Thanks everyone who has taken a look—everyone who hasn’t yet, please check it out, too, and share your thoughts!
➖ For the moment I still consider us a bit weak on the analysis side. This may nothing to be worried about yet given that there seems to be help by the project (and we’re watched over 🙏), however if anyone could have a closer look at this, including Brian’s and others’ feedback, and maybe drive this, that could be a great help.
Catalin, Ian, and I are going to meet up tomorrow (Monday) to sync up and discuss the doc in person. We’re going to follow up again here.
Quick status update:
We have a private Slack channel #markup-2020 now. @zcorpan, @matuzo, I couldn’t find you in Slack yet but please hit me up so to add you—the same applies to everyone who would work with us. (It’s a private channel as public feedback can already be covered here.)
The metrics section seemed in need of refinement. I set up a spreadsheet to be clearer about the questions we try to answer and the data to retrieve for that. I’ve begun to reflect our doc’s metrics there; @catalinred, since the “patterns” section includes most of your suggestions, can you help and transfer those over (to complete the “Question to answer” column)?
Once we have moved the metrics I hope we can have better conversation around the data needed as well as what we deem the most important items to cover (as Catalin’s own analyses show, element popularity alone is already a huge topic 😊).
@HTTPArchive/analysts this chapter is in need of your help!
I'm doing analytics for SEO and noticed there is quite an overlap in the metrics needed. I could help by including what you need in the metrics we gather. e.g. we already get all link and meta tags which would cover some of your requirements. We also gather data on links, hXs and images.
Thanks @Tiggerito! Can I put you down as this chapter's designated analyst, or are you only able to help with custom metrics? The custom metrics are the highest priority so even that would be greatly appreciated.
Put me down. With my metrics and SQL pull requests (getting big), should I add in what's needed with this chapter?
Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.
Great thank you! Yes, feel free to bundle SEO and Markup custom metrics into a single PR if needed.
Great, I'll rename the pull requests to reflect that.
Great to have you here, @Tiggerito!
Going through the metrics sheet I think we transferred almost all data, which I crossed out in the doc. @ibnesayeed, @catalinred, can you check on and move the few items that don’t seem to have been transferred over yet (javascript
links, link targets, boilerplate)?
I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?
Unless I’m missing something important I’d propose proceeding from two ends:
Call me out on and excuse anything I’m missing please, and then thanks everyone who has invested in reviewing and documenting over the last few days!
Hi @j9t and all,
I'll read through stuff to get myself up to speed and see if anything was missed.
I've already pulled over the old SQL scripts. That helped me get a grasp what was done last year, and an idea on what can be done.
First glance on the metrics sheet and I think we may have a few that are not viable. Mainly on the checking validity and style of the html. Most of our data comes from the DOM of a rendered crawl, which means we are looking at parsed and cleaned up html. Syntax errors handled, formatting like type of quote removed, if a tag was specified as self closing or an empty open/close etc.
The rest look like they can be gathered from the DOM, WebPageTest data, or other data gathered during the crawl.
We can peek at the raw html, however that query is very expensive (I think $80 a go). And it's very hard to process it as we only have basic string manipulation like regex to work with. It's recommended that we avoid these sorts of queries. I don't think we would be able to do syntax type tests on it anyhow.
The critical bit at the moment is to make sure we get the almanac.js file updated so it captures all the data needed. We have a week to get that code implemented tested and approved. I'll focus on the sheet and add notes about viability and if almanac.js is involved, what state that is in.
We also have access to Lighthouse data. I'm not sure what it reports on related to markup at this time.
And a Technologies table that may come in use. e.g. to identify WordPress for us.
I've gone through the spreadsheet and added a few new columns and comments/questions.
I've gone through the spreadsheet and added a few new columns and comments/questions.
Thanks, @Tiggerito, I'll add my thoughts/answers to the spreadsheet as well.
Thanks @Tiggerito, @catalinred!
What's the most pressing we need right now?
(I’m a bit behind in terms of reviewing and working on the metrics again but it’s #1 on my list.)
I see us somewhere between reviewing and normalizing that sheet and “just” starting with analysis—is that a view you’d share, especially @rviscomi given your experience?
Yes, if I understand the question correctly, you can use the results of the metrics triaging to revisit the chapter outline and adjust as needed. The biggest rush is accounting for any custom metrics that need to be added before the August 1 crawl. You could continue revising the outline throughout August before the analysis is done. It's a good idea to lock in the scope of the chapter for the sake of not introducing any newly required metrics, but if you and @Tiggerito agree to a bit of scope creep and it doesn't require going back and adding custom metrics, that's ok too. Hope that answers your question!
What's the most pressing we need right now?
In my opinion, we should first decide on the approximate chapter categorization.
After deciding on the matter above, we need to carefully study all the suggested metrics in the sheet and decide which one are really important, as we might need extra help from the @HTTPArchive/analysts team because @Tiggerito said he might not have enough time to invest in all the metrics we're looking for.
I've got what I think is an almost complete version of the script to get the data. I think there are still a few open items, but most things have been implemented.
Here is a zipped up example of the data:
If you provide me with test URLs and what you expect, I can run the tests.
In my opinion, we should first decide on the approximate chapter categorization.
I added categories and made some other updates to the metrics spreadsheet that I hope are useful here, and help us get a more detailed view at what we try to do. (This is not to distract from @Tiggerito’s and @catalinred’s work, who have both added more substantial value over the last few days.)
I also adjusted the doc outline a little bit, aligning it with the metrics and thereby simplifying the structure (general, elements, attributes). I’m not sure we need subsections at this point, but that these depend on what we can measure and then decide to interpret.
(Completely open to discuss and revisit all of this—please share your views.)
@Tiggerito, @catalinred, it seems we won’t get data for everything we would like to look at, but can already cover a good number of metrics. Can you help assess whether we should spend more time on refining, or go for it?
Part I Chapter 3: Markup
Content team
Content team lead: @j9t
Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.
The content team is made up of the following contributors:
New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.
Note: To ensure that you get notifications when tagged, you must be "watching" this repository.
Milestones
0. Form the content team
1. Plan content
2. Gather data
3. Validate results
4. Draft content
5. Publication