Closed rviscomi closed 2 years ago
Happy to contribute as a peer reviewer here.
Hi Rick - I would like to sign up as a data analyst here. Thanks!
📟 paging 2019/2020 contributors: @samdutton @alankent @voltek62 @wizardlyhel @rockeynebhwani @jrharalson @drewzboto
Would any of you be interested to contribute to the 2021 chapter? I'd especially like to see more 2019/2020 authors become 2021 reviewers to help ease the transition and similarly I think prior reviewers would make great 2021 authors, being familiar with the process already. And prior analysts would make excellent 2021 analysts 😁
Or is there anyone new you'd like to see?
@soulcorrosion did you have interest in being an author or peer reviewer for this chapter?
Happy to review!
On Wed, 5 May 2021, 21:17 Rick Viscomi, @.***> wrote:
@soulcorrosion https://github.com/soulcorrosion did you have interest in being an author or peer reviewer for this chapter?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/almanac.httparchive.org/issues/2155#issuecomment-832978609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSDKTR4EOBB4SA72QNLRTTMGRVVANCNFSM43UFNOMA .
Hi, I'd be interested in putting my name forward as an author.
@soulcorrosion did you have interest in being an author or peer reviewer for this chapter?
Hi @rviscomi I can review. Since I'm reviewing in another chapter as well, authoring can be too much.
I am happy to review as well of course (I offered via email but had not noted it here as well).
@bobbyshaw thanks for your interest in authoring this chapter! As the content team lead, you'll be responsible for the scope and direction of the chapter and keeping it on schedule. We automatically monitor the staffing and progress of each chapter based on the state of the initial comment so please keep that updated as you add new contributors and meet each milestone.
We've created a Google Doc for this chapter, which you're encouraged to use to collaborate with the content team on the initial outline, metrics, and ultimately the final draft.
Next steps for this chapter are:
There's not currently a section coordinator for this chapter, so I'll be periodically checking in with you directly to make sure the chapter is staying on schedule. Reach out here in this issue if you have any questions about the process.
More information about the content team lead and author roles and responsibilities are available for reference in the wiki if needed.
To anyone else interested in contributing to this chapter, please comment below to join the team!
Hello to the Ecomm team! I'm here to assist you in any way possible and to help keep the project on track. To that end, please let me know if I can assist along the way, and thank you for volunteering your time and effort into making the next release the best ever!
I can contribute as an editor if its needed :)
Awesome, @shantsis. Thank you 👍
I've got a first draft of a content plan in the Google Doc https://docs.google.com/document/d/1LQjpsaWx-5ZtHQGRnHlPnekkxuap50KzJZJTIaSX4B4/edit#. I don't think there's anything too surprising or novel that will require lots of new metrics at this point but I'll continue thinking about it and reviewing. I'm also going to take a look at Wappalyzer signatures to see if there's anything else I can contribute to ensure it's representative of current trends in e-commerce platforms.
I'm not sure how involved reviewers get at this stage but tagging you now in case you want to have a read through of this first draft of the content plan - cc @fili @samdutton @alankent @soulcorrosion
Hi @rrajiv, I'll take a look too but would you like to start having a think about metrics that we might need or want? I appreciate it's you and me as author & analyst. I don't have any experience querying the underlying datasets but happy to dig in and learn if it's something you think you'll need support with.
@bobbyshaw - I'll check out last year's metrics first to get an idea what was done. I don't have too much experience as well as this is my first time being an analyst ;)
@bobbyshaw - Based on last year's experience, I made some observations with respect to eCommerce platform detection and have raised couple of issues for Wappalyzer. You guys may want to get these sorted to get quality data -
https://github.com/AliasIO/wappalyzer/issues/3983 https://github.com/AliasIO/wappalyzer/issues/3984
Raised one more issue with respect to platform detection (this time for BigCommerce) - https://github.com/AliasIO/wappalyzer/issues/3986
@rrajiv How's your investigation into metrics going? Looking at the schedule we should have a pretty good idea of what we need by the end of this month.
@bobbyshaw @rrajiv - If you notice in last year's chapter, we had used 'Google Analytics Enhanced Ecommece' as a signal to identify eCommerce sites (which helped in cases where Wappalyzer didn't have platform detection OR site was built using headless architecture).
This year, you can further increase the eCommerce coverage by looking for another category in Wappalyzer which was not present last year. There is a new category called 'Payment Processors' (category id - 41) in Wappalyzer. Any sites with a payment processor can be classified as an eCommerce site. Sometimes, it's possible to detect the payment processor but not possible to detect eCommerce platform.
For example, in below screenshot, you can see 'Amazon Pay' as payment processor (though in this case, eCommerce platform is also being detected but you can easily find examples, where eCommerce platform may not get detected at all)
If you decide to use this as a signal, you will have to modify last year's queries to reflect this.
@bobbyshaw - I reviewed the 2020 ecommerce chapter & the associated queries. I can't think of any additional metrics as of now. There is still a few days till EOM so I will look to see if there is anything else to capture.
@rockeynebhwani - yes I remember this from the 2020 chapter and I went through the Wappalyzer queries. As Iam not familiar with Wappalzer, I went through this post: https://discuss.httparchive.org/t/using-wappalyzer-to-analyze-cpu-times-across-js-frameworks/1336/ and got a sense of how Wappalyzer can help in this case.
So in the case where no Ecommerce is detected but a payment processor is detected, would it still matter (or be interesting to write about)?
I can write this without any worry about how hard to go off and do (!), but is it worth manually inspecting a few sites that fall into the bucket of "has payments, not detected as ecomm site" to see if really an ecommerce site? I have no idea how many ecommerce sites would have payment processing hints on the home page (which is what HTTP Archive samples) vs the payments page, so no idea how many extra sites this would find, then no idea how many of those sites are what we would consider ecommerce (e.g. vs subscribing to a newspaper where you can sign up on the home page).
I didn't get notified about the comments on the google doc. Let me review those over the weekend and then we can consider the outline completed.
Draft outline comments reviewed, I'm happy to officially mark the outline as complete. The actual content will depend on what metrics can be achieved and what we find in the results
@rrajiv As I understand it, the next step is to document all of the metrics/queries that will be used. I believe we should be looking to add the queries to this repo https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/sql/2021/ecommerce and reference them in the metrics section of the Google Doc. From your time researching last year's queries, are you ready to pull them across and add/update for any new queries we might need, please?
Hi @bobbyshaw - yes, this is on my list and i'll start by going through the 2020 queries this week. As Iam a new contributor to this repository, I also will figure out how to commit to the 2021 page.
Hi @bobbyshaw - I went through the 2020 queries. They can be certainly updated to 2021 and I'll start working on those.
Should I just update all of them to 2021, and then we figure out what we can use and what needs to be written new?.
@rrajiv Yep, that sounds like a good place to start. If you can pull them all in and add a reference to the google doc, we can then do a mapping between topic and metric to see where we might need some other queries written.
@bobbyshaw - I got the ball rolling. Based on me going through the 2020 queries and updating them to 2021, I put a list out of what we have so far, after replacing 2020 with 2021. I hope to have some time early this week to go through the outline to find the gaps in our 2021 list
Link to PR: https://github.com/HTTPArchive/almanac.httparchive.org/pull/2300
I can write this without any worry about how hard to go off and do (!), but is it worth manually inspecting a few sites that fall into the bucket of "has payments, not detected as ecomm site" to see if really an ecommerce site? I have no idea how many ecommerce sites would have payment processing hints on the home page (which is what HTTP Archive samples) vs the payments page, so no idea how many extra sites this would find, then no idea how many of those sites are what we would consider ecommerce (e.g. vs subscribing to a newspaper where you can sign up on the home page).
@alankent - Lately, I have noticed 3 things -
@rrajiv - I think you should query Sites where category = 'Payment Processors' but at the same time category != 'Ecommerce'. You will need a self join on technologies table but I think you are going to find Ecommerce sites which are not caught by our current queries... Can't say how many though but this should be a simple query.
It took me a while to come up with a query that I think works.. I tested it on the sample table first as Iam still waiting for access to the GCP project. Seems ok to me but i'll let everyone weigh in.
#standardSQL
SELECT
DISTINCT(url)
FROM
`httparchive.sample_data.technologies_*` ha1
WHERE
ha1.category = "Payment processors"
AND NOT EXISTS (
SELECT
url,
category
FROM
`httparchive.sample_data.technologies_*` ha2
WHERE
ha1.url = ha2.url
AND (ha2.category = 'Ecommerce') )
GROUP BY
url
Few examples based on the sample data set:
https://estudiafeliz.com/ https://academy.kiwanis.eu/ https://chinawok-schwetzingen.de/ https://freebirds.com/
@rrajiv are you on the Slack channel? Please DM me so I can get you added to the GCP project.
Hi all - I have been a little busy so I will start running the queries from this week onwards. As a start, I plan on starting with what we used in 2020
and run the same queries for 2021
.
@rockeynebhwani - the above is a sample just to get a bunch of sites as examples which match payment but not ecomm. Is there anything you want to explore here?
@bobbyshaw - anything else you are looking for that was perhaps not addressed in the 2020
edition?
@rrajiv Thank you for the update. Much appreciated 😊
@rrajiv - I ran your query on main tables and I found some issues with our crawl as I was seeing sites with payment processors but Ecommerce category was missing. @pmeenan spotted an issue with our crawl when we looked at some examples. Issue raised here - https://github.com/HTTPArchive/cwv-tech-report/issues/25. Basically, cookie based detection in Wappalyzer was not working. This was the case for all technologies and not only for Ecommerce platforms. For example, OpenCart as a platform was never detected as we have only cookie based detection for this platform. This has possibly impacted stats for all other platforms also but any revised stats post this fix will only be available in Sep-2021 crawl. I don't think we should wait till Sep-2021 and just carry on with July-2021 stats (knowing that cookie based detection doesn't work).
I was able to also spot many additional platforms from that list and have raised multiple issues with Wappalyzer (which will be useful for next year). I think think that checking for 'Payment Processors' will give you increased coverage but you will have to modify all queries and your queries will become more complex. Take a call how you want to proceed.
But this is no different than last year is it @rockeynebhwani ? Or is this a new problem only affecting this years stats?
Couldn't we analyze the cookie data in the requests
table to reproduce the missing detection in July?
@tunetheweb - Yes .. after going through Pat's fix, I can say that this problem affected last year's stats also (Not only for Ecommerce but for CMS chapter also possibly).
@rviscomi - It will be very cumbersome to do this using requests table.. too many Ecommerce platforms and many have cookie based detection in addition to other types.. I think we should analyse based on July-2021 and may be run key queries for Sep-2021 and see if any key findings change and adjust if necessary. It's very likely that even with this bug, most findings remain same.
SGTM.
Looking at Wappalyzer's ecommerce stats it's likely OpenCart is in the top 10 so may want to consider if we should still show the top 10 list like we did last year without this (e.g. maybe show a top 5 instead?), or continue with top 10 and explain it's absence (which might cause more confusion than it avoids!).
Hi all,
I have completed an initial run and updated the queries for 2021.
spreadsheet: https://docs.google.com/spreadsheets/d/1HCfrXJ52lV46UvxOvDVjJj70fOeFVrTTD8DUm0tPVXE/edit#gid=1519192867
I'll start filling in the comments into the cell this week onwards.
Iam using the 2020 Ecommerce diagram titles as a reference to 2021 so that we can connect whatever that has been done in 2020 to the new data in 2021
For reference, this is the 2020 chapter: https://almanac.httparchive.org/en/2020/ecommerce
Here is a description of sheet and the data it covers with reference to the previous year's figure number
@bobbyshaw - let me know if there are other areas to explore and I can start figuring out the queries for those.
@rviscomi - While looking at results put together by @rrajiv, I noticed that number of origins for eCommerce platforms like WooCommerce, Shopify etc. don't match with Core Web Vitals Technology report.
These are results from @rrajiv's spreadsheet (queries were for July-2021 table)
Number of origins for top platforms from Core Web Vitals technology report
Any idea why the difference?
I don't have access to @rrajiv's queries so I'll defer to him. The CWV Tech Report query is available here.
@rviscomi - This is @rrajiv's query.. He is just using 2020 queries for now... I just changed table to 2021 July and ran and I get results what @rrajiv has in his spredasheet
SELECT
_TABLE_SUFFIX AS client,
app,
COUNT(DISTINCT url) AS freq,
total,
COUNT(DISTINCT url) / total AS pct
FROM
`httparchive.technologies.2021_07_01_*`
JOIN
(SELECT
_TABLE_SUFFIX,
COUNT(DISTINCT url) AS total
FROM
`httparchive.summary_pages.2021_07_01_*`
GROUP BY
_TABLE_SUFFIX)
USING (_TABLE_SUFFIX)
WHERE
category = 'Ecommerce'
GROUP BY
client,
app,
total
ORDER BY
client DESC,
pct DESC
LIMIT 1000
All the queries for 2021 are currently on this branch: https://github.com/HTTPArchive/almanac.httparchive.org/tree/ecommerce-sql-2021/sql/2021/ecommerce
I think the source of the discrepancy is that the July 2021 CrUX dataset was released a couple of weeks ago and we won't run the technology detections on it until the September 2021 HTTP Archive crawl. So the 100k difference may be attributed to websites that fell out of the CrUX dataset between May and July. In other words, the detections HTTP Archive ran in the July crawl were based on websites in the May CrUX dataset.
So for this chapter's purposes, you're correctly measuring the number of mobile ecommerce pages as 444,390. If you want to join that with the latest CrUX dataset though, for example to report on real-user Core Web Vitals performance, the intersection only contains 334,171 pages. Alternatively you could join it with the older May CrUX dataset, which should have data for 100% of the pages.
The CWV Tech Report dashboard favors the latest real-user CWV data rather than the latest lab-based technology detections, so it trades off coverage.
Hi all - let me know if there are any more queries to write/research/run. Thanks
Thanks @rrajiv. I'll try and get back to you in the next week on this.
Hey @rrajiv. Thanks for all your work on the metrics. I've been through the google doc and annotated where each one will be used. This is helpful to see where there might be some gaps.
If it's not too much bother, I'd be interested in getting stats on a few more technology categories:
There were a couple of other minor potential discussion points in the outline. Are either of these feasible?
If I wanted to get an idea of how ecommerce sites compare to the wider web, e.g. with regards to lighthouse scores and number of technologies detected, is this a metric easily available to us? Given how any performance or technology decision has a direct effect on revenue of ecommerce sites, it'd be interesting to see how they compared.
I've spotted a couple of other metrics that other chapter teams have created, e.g. payment request API but I assume we can just re-use those once they've got the results.
If you have any questions/comments just shout. Otherwise I look forward to getting the result data!
@bobbyshaw - for the following, I assume you just mean sites that are "Ecommerce" && one of the following?. I.e What you are looking for is "Top
Top "JavaScript frameworks" category Top "JavaScript libraries" category Top CMS technology category Top "Page Builders” technology category Top “A/B testing” technology category. Top “Personalisation” technology category Top “Loyalty & Rewards” technology category Top “reviews” technology category Top “Translation” technology category Top “Buy now / pay later” technology category
I will look into these once I get the above stats. Likely we can re-use something from the security section regarding the content-security-policy
header since they covered it last year: https://almanac.httparchive.org/en/2020/security#content-security-policy
There were a couple of other minor potential discussion points in the outline. Are either of these feasible? Use of link hrelang tags (to indicate international ecommerce) Use of content-security-policy header set? (report only/enforce)
Iam not aware of something easily available but Iam thinking along the lines of crafting a query that compares say lighthouse scores for ecommerce and non-ecommerce. I don't know yet if this will capture what we want.
If I wanted to get an idea of how ecommerce sites compare to the wider web, e.g. with regards to lighthouse scores and number of technologies detected, is this a metric easily available to us? Given how any performance or technology decision has a direct effect on revenue of ecommerce sites, it'd be interesting to see how they compared.
@rrajiv
What you are looking for is "Top for Ecommerce sites"? Yes, please.
Iam not aware of something easily available but Iam thinking along the lines of crafting a query that compares say lighthouse scores for ecommerce and non-ecommerce
I think this would be enough to be a point in the article. Basically this query without the category filter would tell us whether ecommerce sites are on average over or under performing the rest of the web.
Thank you!
@bobbyshaw - thanks!. I’ll work on these over the next few days and get back to you.
Part III Chapter 17: Ecommerce
If you're interested in contributing to the Ecommerce chapter of the 2021 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.
Content team
Expand for more information about each role
- The **[content team lead](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Content-Team-Leads'-Guide)** is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress. - **[Authors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Authors'-Guide)** are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report. - **[Reviewers](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Reviewers'-Guide)** are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases. - **[Analysts](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Analysts'-Guide)** are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly. - **[Editors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Editors'-Guide)** are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit. - The **[section coordinator](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Section-Leads'-Guide)** is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule. _Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors._ For an overview of how the roles work together at each phase of the project, see the [Chapter Lifecycle](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Chapter-Lifecycle) doc.Milestone checklist
0. Form the content team
1. Plan content
2. Gather data
3. Validate results
4. Draft content
5. Publication
Chapter resources
Refer to these 2021 Ecommerce resources throughout the content creation process:
📄 Google Docs for outlining and drafting content 🔍 SQL files for committing the queries used during analysis 📊 Google Sheets for saving the results of queries 📝 Markdown file for publishing content and managing public metadata