HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
611 stars 170 forks source link

WebAssembly 2021 #2168

Closed rviscomi closed 2 years ago

rviscomi commented 3 years ago

Part I Chapter 6: WebAssembly

If you're interested in contributing to the WebAssembly chapter of the 2021 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.

Content team

Lead Authors Reviewers Analysts Editors Coordinator
@RReverser @RReverser @jsoverson @carlopi @RReverser - @rviscomi
Expand for more information about each role - The **[content team lead](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Content-Team-Leads'-Guide)** is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress. - **[Authors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Authors'-Guide)** are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report. - **[Reviewers](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Reviewers'-Guide)** are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases. - **[Analysts](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Analysts'-Guide)** are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly. - **[Editors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Editors'-Guide)** are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit. - The **[section coordinator](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Section-Leads'-Guide)** is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule. _Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors._ For an overview of how the roles work together at each phase of the project, see the [Chapter Lifecycle](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Chapter-Lifecycle) doc.

Milestone checklist

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

Chapter resources

Refer to these 2021 WebAssembly resources throughout the content creation process:

πŸ“„ Google Docs for outlining and drafting content πŸ” SQL files for committing the queries used during analysis πŸ“Š Google Sheets for saving the results of queries πŸ“ Markdown file for publishing content and managing public metadata

RReverser commented 3 years ago

I suspect I'll be an author :)

rviscomi commented 3 years ago

Safe to say so, I think! I've added you to the team. 😁

rviscomi commented 3 years ago

@RReverser thanks for your interest in authoring this chapter! As the content team lead, you'll be responsible for the scope and direction of the chapter and keeping it on schedule. We automatically monitor the staffing and progress of each chapter based on the state of the initial comment so please keep that updated as you add new contributors and meet each milestone.

We've created a Google Doc for this chapter, which you're encouraged to use to collaborate with the content team on the initial outline, metrics, and ultimately the final draft.

Next steps for this chapter are:

There's not currently a section coordinator for this chapter, so I'll be periodically checking in with you directly to make sure the chapter is staying on schedule. Reach out here in this issue if you have any questions about the process.

More information about the content team lead and author roles and responsibilities are available for reference in the wiki if needed.

To anyone else interested in contributing to this chapter, please comment below to join the team!

rviscomi commented 3 years ago

Hi @RReverser just checking in. Here are some tips to help keep the chapter on track:

jsoverson commented 3 years ago

I'm following up on @RReverser's request for reviewers from this twitter thread

I'm coming from the WebAssembly developer side, not the web stats side. I'm unfamiliar with the dataset. If you're looking for reviewers who know the scope of a technology and can ramp-up on stats then I can be helpful. If you want the opposite then I'm probably not the best candidate.

rviscomi commented 3 years ago

Welcome @jsoverson! I'll defer to the @RReverser as content team lead to bring you up to speed and do the onboarding.

carlopi commented 3 years ago

Hi, I found @RReverser call for reviewers on twitter, and would like to help out.

I work on compilers to WebAssembly, I am informed on the standardization / toolchains point of view, and would be very interested to learn more about the stats collection side.

RReverser commented 3 years ago

Thanks @jsoverson @carlopi! I've added you to the list. Let's see if more people want to join, and I'll start on a shared doc within the next two weeks.

foxdavidj commented 3 years ago

Thanks @jsoverson @carlopi! I've added you to the list. Let's see if more people want to join, and I'll start on a shared doc within the next two weeks.

Can you add @jsoverson @carlopi to the corresponding roles in the top comment?

RReverser commented 3 years ago

Actually that's what I did when I left that comment, but I don't see them there now... so weird. Added again.

rviscomi commented 3 years ago

πŸ“Ÿ attn content team (@RReverser @jsoverson @carlopi) reminder that the next milestone (complete the chapter outline) is due on June 15. So please request edit access to the WebAssembly chapter doc if you haven't already and brainstorm the contents you'd like to see added to the chapter outline. This doesn't have to be super detailed, you can think of it more like sketching out the table of contents. It's important to get this done on time so that we can make any necessary changes to the test runner before it starts on July 1, for example if we're currently unable to measure something you need for your chapter. WASM is new to the Web Almanac so I don't know what we can and can't support currently, making it all the more important to get this done sooner than later. Let me know if you have any questions.

@RReverser could you also tick the checkbox to mark the 0th milestone as completed in the top comment? This helps us track chapter progress at a glance in #2179.

RReverser commented 3 years ago

@jsoverson @carlopi Can you please request access to the doc as outlined above? I've added a few ideas I had to the outline, but review & more ideas are welcome!

carlopi commented 3 years ago

@RReverser: I wrote down some possible ideas, it's very unclear to me what the mean for collecting the data is (do you have any reference?), would be helpful in shaping what questions are worth exploring. I am not sure how do you want to coordinate with all this, one idea could be also brainstorm over a call at some point, I would look forward to it. (I am based in CET, we can figure out a time that works for everyone)

RReverser commented 3 years ago

@carlopi Generally we just have HTTPArchive + BigQuery data, this can be used for reference and instructions: https://github.com/HTTPArchive/httparchive.org/blob/main/docs/gettingstarted_bigquery.md. That is, we mostly have data about payloads - their compressed & uncompressed sizes, content types and so on. That's what we can extract info from most easily.

However, as mentioned on the doc, due to relatively small amount of Wasm resources among top 1M websites (which are included in the dataset), we can do some simple binary analysis as well - that's why I included things like "how many modules use this feature" for SIMD and threads.

I think control flow analysis would be a bit too expensive to run, and I'm not sure how relevant it is for post-optimisation modules on the Web due to effect of inlining and other passes that significantly change the structure. Instead, I think we should focus on data that shows adoption of Wasm & its new features and how it is actually used in the wild (so e.g. "means of delivery" is definitely an interesting addition).

RReverser commented 3 years ago

one idea could be also brainstorm over a call at some point, I would look forward to it. (I am based in CET, we can figure out a time that works for everyone)

I was thinking we were going to just sync over the doc to account for different timezones more easily, but we can certainly do a call as well.

jsoverson commented 3 years ago

@jsoverson @carlopi Can you please request access to the doc as outlined above? I've added a few ideas I had to the outline, but review & more ideas are welcome!

Requested!

RReverser commented 3 years ago

I guess let's do a meeting after all, it might be easier to talk through the main points. @jsoverson I think you're in different timezone than us, what timerange and dates work for you next week?

jsoverson commented 3 years ago

@RReverser I'm on EST time (UTC -5) and can do Monday, Wednesday, or Thursday (earlier in the week is better)

carlopi commented 3 years ago

I can't on Wednesdays, but generally available between 15-18 CET = 14-17 UK = 9-12 EST. Any slot in that interval that works for everyone (either Monday or Thursday)? Otherwise in the evening, like between 21-23 CET = 20-22 UK = 15-17 EST.

carlopi commented 3 years ago

Content wise another area that could be worth adding is a general explanation of what WebAssembly is / can do / is useful for, for example taking the list here and adding live examples for each use case.

Or simpler, taking a single use case (eg. web-based video call), going in depth about how some solutions (eg. the web clients of Google Meet / Zoom / Microsoft Teams / Jitzi? / fill relevant ones) do leverage WebAssembly, here basically doing an analysis case-by-case (if there are only few cases it might justified to just look at the dev tools while doing a call).

RReverser commented 3 years ago

EST isn't too far, we should find a suitable time. I've sent out an invite for today, but I know it's a bit close so if you can't make it, let's reschedule for either later time or, if that doesn't work either, Thursday.

RReverser commented 3 years ago

As mentioned in the meeting, here are some queries I played with: https://gist.github.com/RReverser/b5e9cff5c4a7ac1eba2c64ab6d01c4d8

rviscomi commented 3 years ago

@RReverser the outline is looking great so far and you seem to be on track for the June 15 milestone! πŸŽ‰

I left a few questions in the doc to get you thinking about implementing any necessary custom metrics before the crawl starts. Those would need to be merged no later than June 30 to be included in the dataset.

rviscomi commented 3 years ago

Hey @RReverser, could you give a brief update on the status of the outline? It's looking really good, just not sure if you're still working on it. If you're done, could you check off Milestone 1 above? Thanks!

RReverser commented 3 years ago

I think we're good in terms of outline - my only concern, as mentioned in DMs, is whether it should be finalized (removing entries that we might end up excluding in metrics), but sounds like that shouldn't be an issue, so we're good to go.

RReverser commented 3 years ago

@carlopi @jsoverson I've started on a Rust tool to collect Wasm content stats here. Feel free to contribute if you'd like to write some Rust or have further ideas: https://github.com/RReverser/wasm-stats

This is what the debug output looks like for Google Earth Wasm for now:

Stats {
    funcs: 56469,
    instructions: InstructionStats {
        total: 7531295,
        refs: RefStats {
            global: 41265,
            local: 2725521,
            table: 62601,
            mem: 1159779,
        },
        proposals: ProposalStats {
            atomics: 31672,
            ref_types: 0,
            simd: 0,
            tail_calls: 0,
            bulk: 6338,
        },
        categories: InstructionCategoryStats {
            load_store: 1148714,
            control_flow: 1063652,
            direct_calls: 241407,
            indirect_calls: 62601,
        },
    },
    size: SizeStats {
        code: 16180782,
        init: 4162601,
        externals: 3319,
        total: 20405582,
    },
    imports: ExternalStats {
        funcs: 420,
        memories: 1,
        globals: 0,
        tables: 0,
    },
    exports: ExternalStats {
        funcs: 59,
        memories: 0,
        globals: 0,
        tables: 1,
    },
}

In principle this is looking good, but I'm getting worried because this is already quite a lot of info that needs to be collected and meaningfully analyzed afterwards, yet it's still not everything that we wrote down in the outline plan.

Perhaps we should try and limit the scope a bit further. For example, do we really want to add stats by type of instruction operand / output - when aggregated, would it communicate something useful to developers? I'm not sure yet.

OTOH stats like section sizes, imports/exports, proposals and instruction categories do seem useful so perhaps we should just keep those.

WDYT?

RReverser commented 3 years ago

@carlopi @jsoverson Ping. What do you think about those metrics / changes?

carlopi commented 3 years ago

I am generally on board with keeping the scope manageable (= cutting stuff), later I will check wasm-stat and see whether I can be of use

RReverser commented 3 years ago

FWIW I'm on vacation next week, but would appreciate any feedback meanwhile; next month will be busy as we'll start downloading Wasm files and analyzing all the data :)

RReverser commented 3 years ago

Hmm I'm guessing I'll just have to go ahead with my best judgment...

jsoverson commented 3 years ago

I must not have submitted my comment before traveling, sorry.

I don't think the operand details are going to be valuable enough and the stats around security settings are probably niche enough to ignore. I had some work done on the rust project but didn't get to a useful stopping point before I left.

RReverser commented 3 years ago

Thanks for the response. Meanwhile I wrote a small script and downloaded most of the Wasm files - looks like out of ~2.7K in dataset only ~2.2K are unique URLs + reachable so that's what I'll be next analyzing using the Rust repo above.

RReverser commented 3 years ago

After some retries got to almost ~2.3K unique URLs, which, interestingly, results in only 713 unique Wasm files (many are copies under different URLs).

That's fewer than I hoped but in itself it's also an interesting stat related to reusability of Wasm.

RReverser commented 3 years ago

One interesting question that arises from this level of reusability is: do we want to aggregate stats by pages, by websites (domains), or by unique Wasm modules?

E.g. is "5% of unique Wasm modules rely on SIMD" more or less valuable than "5% of all pages using Wasm rely on SIMD" or "5% of websites using Wasm relies on SIMD"?

It's tempting to do all of it, but multiplied by number of stats it's just impractical.

@carlopi @jsoverson Thoughts welcome.

rviscomi commented 3 years ago

I think stats in terms of # or % of pages are most easily understood by readers.

RReverser commented 3 years ago

I think stats in terms of # or % of pages are most easily understood by readers.

Maybe, but then if the same library is included on lots of pages, it can "drown" stats from Wasm used on a single popular website. The balance seems tricky...

rviscomi commented 3 years ago

πŸ‘‹ Hey @RReverser, just checking in on each chapter's progress. It looks like you're all set but let me know if you run into any issues.

RReverser commented 3 years ago

Yeah no new issues right now. Chatted a bit more, we're going with breakdown by pages then.

RReverser commented 3 years ago

FWIW I've analyzed the downloaded Wasms using the current state of wasm-stats repo above, saved results to JSON and imported to BigQuery, so now it's possible to join them with the summary_requests and the list of Wasm URLs to do any kinds of aggregations.

I'm happy to share access if anyone wants it (and if I figure out how to do that in BigQuery...)

image image

rviscomi commented 3 years ago

@RReverser let's coordinate to get this data imported into the public httparchive.almanac dataset, so the results can be backed by publicly runnable queries.

rviscomi commented 3 years ago

Note: I unchecked "Milestone 2" in the top comment as I'm not seeing the draft PR in the list of open PRs. @RReverser I know you're working on it so feel free to update it whenever available. Let me know if you run into any blockers.

RReverser commented 3 years ago

Oh, I misunderstood that milestone upon first read, I thought it was for adding custom metrics to the crawler.

RReverser commented 3 years ago

@carlopi @jsoverson FWIW I've added a bunch of metrics to the spreadsheet already, if you want to take a look before they're turned into graphs and into a post.

rviscomi commented 2 years ago

@RReverser @jsoverson @carlopi

πŸŽ‰ This chapter is fully written, reviewed, edited, and ready to be launched on Wednesday! Thank you to all of the contributors who put in the time and effort to make this a great chapter.

When you get 5 minutes, I'd really appreciate if you could fill out our contributor survey to tell us (the project leads) about your experience. It's super helpful to hear what went well or what could be improved for next time. πŸ™

Congratulations and thank you all again. I'm excited for this to launch soon!