HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
613 stars 170 forks source link

Ecommerce 2021 #2155

Closed rviscomi closed 2 years ago

rviscomi commented 3 years ago

Part III Chapter 17: Ecommerce

Ecommerce illustration

If you're interested in contributing to the Ecommerce chapter of the 2021 Web Almanac, please reply to this issue and indicate which role or roles best fit your interest and availability: author, reviewer, analyst, and/or editor.

Content team

Lead Authors Reviewers Analysts Editors Coordinator
@bobbyshaw @bobbyshaw @rockeynebhwani @fili @samdutton @alankent @soulcorrosion @rrajiv @shantsis @logicalphase
Expand for more information about each role - The **[content team lead](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Content-Team-Leads'-Guide)** is the chapter owner and responsible for setting the scope of the chapter and managing contributors' day-to-day progress. - **[Authors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Authors'-Guide)** are subject matter experts and lead the content direction for each chapter. Chapters typically have one or two authors. Authors are responsible for planning the outline of the chapter, analyzing stats and trends, and writing the annual report. - **[Reviewers](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Reviewers'-Guide)** are also subject matter experts and assist authors with technical reviews during the planning, analyzing, and writing phases. - **[Analysts](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Analysts'-Guide)** are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly. - **[Editors](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Editors'-Guide)** are technical writers who have a penchant for both technical and non-technical content correctness. Editors have a mastery of the English language and work closely with authors to help wordsmith content and ensure that everything fits together as a cohesive unit. - The **[section coordinator](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Section-Leads'-Guide)** is the overall owner for all chapters within a section like "User Experience" or "Page Content" and helps to keep each chapter on schedule. _Note: The time commitment for each role varies by the chapter's scope and complexity as well as the number of contributors._ For an overview of how the roles work together at each phase of the project, see the [Chapter Lifecycle](https://github.com/HTTPArchive/almanac.httparchive.org/wiki/Chapter-Lifecycle) doc.

Milestone checklist

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

Chapter resources

Refer to these 2021 Ecommerce resources throughout the content creation process:

📄 Google Docs for outlining and drafting content 🔍 SQL files for committing the queries used during analysis 📊 Google Sheets for saving the results of queries 📝 Markdown file for publishing content and managing public metadata

fili commented 3 years ago

Happy to contribute as a peer reviewer here.

rrajiv commented 3 years ago

Hi Rick - I would like to sign up as a data analyst here. Thanks!

rviscomi commented 3 years ago

📟 paging 2019/2020 contributors: @samdutton @alankent @voltek62 @wizardlyhel @rockeynebhwani @jrharalson @drewzboto

Would any of you be interested to contribute to the 2021 chapter? I'd especially like to see more 2019/2020 authors become 2021 reviewers to help ease the transition and similarly I think prior reviewers would make great 2021 authors, being familiar with the process already. And prior analysts would make excellent 2021 analysts 😁

Or is there anyone new you'd like to see?

rviscomi commented 3 years ago

@soulcorrosion did you have interest in being an author or peer reviewer for this chapter?

samdutton commented 3 years ago

Happy to review!

On Wed, 5 May 2021, 21:17 Rick Viscomi, @.***> wrote:

@soulcorrosion https://github.com/soulcorrosion did you have interest in being an author or peer reviewer for this chapter?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HTTPArchive/almanac.httparchive.org/issues/2155#issuecomment-832978609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABSDKTR4EOBB4SA72QNLRTTMGRVVANCNFSM43UFNOMA .

bobbyshaw commented 3 years ago

Hi, I'd be interested in putting my name forward as an author.

soulcorrosion commented 3 years ago

@soulcorrosion did you have interest in being an author or peer reviewer for this chapter?

Hi @rviscomi I can review. Since I'm reviewing in another chapter as well, authoring can be too much.

alankent commented 3 years ago

I am happy to review as well of course (I offered via email but had not noted it here as well).

rviscomi commented 3 years ago

@bobbyshaw thanks for your interest in authoring this chapter! As the content team lead, you'll be responsible for the scope and direction of the chapter and keeping it on schedule. We automatically monitor the staffing and progress of each chapter based on the state of the initial comment so please keep that updated as you add new contributors and meet each milestone.

We've created a Google Doc for this chapter, which you're encouraged to use to collaborate with the content team on the initial outline, metrics, and ultimately the final draft.

Next steps for this chapter are:

There's not currently a section coordinator for this chapter, so I'll be periodically checking in with you directly to make sure the chapter is staying on schedule. Reach out here in this issue if you have any questions about the process.

More information about the content team lead and author roles and responsibilities are available for reference in the wiki if needed.

To anyone else interested in contributing to this chapter, please comment below to join the team!

logicalphase commented 3 years ago

Hello to the Ecomm team! I'm here to assist you in any way possible and to help keep the project on track. To that end, please let me know if I can assist along the way, and thank you for volunteering your time and effort into making the next release the best ever!

shantsis commented 3 years ago

I can contribute as an editor if its needed :)

logicalphase commented 3 years ago

Awesome, @shantsis. Thank you 👍

bobbyshaw commented 3 years ago

I've got a first draft of a content plan in the Google Doc https://docs.google.com/document/d/1LQjpsaWx-5ZtHQGRnHlPnekkxuap50KzJZJTIaSX4B4/edit#. I don't think there's anything too surprising or novel that will require lots of new metrics at this point but I'll continue thinking about it and reviewing. I'm also going to take a look at Wappalyzer signatures to see if there's anything else I can contribute to ensure it's representative of current trends in e-commerce platforms.

I'm not sure how involved reviewers get at this stage but tagging you now in case you want to have a read through of this first draft of the content plan - cc @fili @samdutton @alankent @soulcorrosion

Hi @rrajiv, I'll take a look too but would you like to start having a think about metrics that we might need or want? I appreciate it's you and me as author & analyst. I don't have any experience querying the underlying datasets but happy to dig in and learn if it's something you think you'll need support with.

rrajiv commented 3 years ago

@bobbyshaw - I'll check out last year's metrics first to get an idea what was done. I don't have too much experience as well as this is my first time being an analyst ;)

rockeynebhwani commented 3 years ago

@bobbyshaw - Based on last year's experience, I made some observations with respect to eCommerce platform detection and have raised couple of issues for Wappalyzer. You guys may want to get these sorted to get quality data -

https://github.com/AliasIO/wappalyzer/issues/3983 https://github.com/AliasIO/wappalyzer/issues/3984

rockeynebhwani commented 3 years ago

Raised one more issue with respect to platform detection (this time for BigCommerce) - https://github.com/AliasIO/wappalyzer/issues/3986

bobbyshaw commented 3 years ago

@rrajiv How's your investigation into metrics going? Looking at the schedule we should have a pretty good idea of what we need by the end of this month.

rockeynebhwani commented 3 years ago

@bobbyshaw @rrajiv - If you notice in last year's chapter, we had used 'Google Analytics Enhanced Ecommece' as a signal to identify eCommerce sites (which helped in cases where Wappalyzer didn't have platform detection OR site was built using headless architecture).

This year, you can further increase the eCommerce coverage by looking for another category in Wappalyzer which was not present last year. There is a new category called 'Payment Processors' (category id - 41) in Wappalyzer. Any sites with a payment processor can be classified as an eCommerce site. Sometimes, it's possible to detect the payment processor but not possible to detect eCommerce platform.

For example, in below screenshot, you can see 'Amazon Pay' as payment processor (though in this case, eCommerce platform is also being detected but you can easily find examples, where eCommerce platform may not get detected at all)

image

If you decide to use this as a signal, you will have to modify last year's queries to reflect this.

rrajiv commented 3 years ago

@bobbyshaw - I reviewed the 2020 ecommerce chapter & the associated queries. I can't think of any additional metrics as of now. There is still a few days till EOM so I will look to see if there is anything else to capture.

rrajiv commented 3 years ago

@rockeynebhwani - yes I remember this from the 2020 chapter and I went through the Wappalyzer queries. As Iam not familiar with Wappalzer, I went through this post: https://discuss.httparchive.org/t/using-wappalyzer-to-analyze-cpu-times-across-js-frameworks/1336/ and got a sense of how Wappalyzer can help in this case.

So in the case where no Ecommerce is detected but a payment processor is detected, would it still matter (or be interesting to write about)?

alankent commented 3 years ago

I can write this without any worry about how hard to go off and do (!), but is it worth manually inspecting a few sites that fall into the bucket of "has payments, not detected as ecomm site" to see if really an ecommerce site? I have no idea how many ecommerce sites would have payment processing hints on the home page (which is what HTTP Archive samples) vs the payments page, so no idea how many extra sites this would find, then no idea how many of those sites are what we would consider ecommerce (e.g. vs subscribing to a newspaper where you can sign up on the home page).

rviscomi commented 3 years ago

Hi @bobbyshaw, how can we help get the outline into a final state?

bobbyshaw commented 3 years ago

I didn't get notified about the comments on the google doc. Let me review those over the weekend and then we can consider the outline completed.

bobbyshaw commented 3 years ago

Draft outline comments reviewed, I'm happy to officially mark the outline as complete. The actual content will depend on what metrics can be achieved and what we find in the results

@rrajiv As I understand it, the next step is to document all of the metrics/queries that will be used. I believe we should be looking to add the queries to this repo https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/sql/2021/ecommerce and reference them in the metrics section of the Google Doc. From your time researching last year's queries, are you ready to pull them across and add/update for any new queries we might need, please?

rrajiv commented 3 years ago

Hi @bobbyshaw - yes, this is on my list and i'll start by going through the 2020 queries this week. As Iam a new contributor to this repository, I also will figure out how to commit to the 2021 page.

rrajiv commented 3 years ago

Hi @bobbyshaw - I went through the 2020 queries. They can be certainly updated to 2021 and I'll start working on those.

Should I just update all of them to 2021, and then we figure out what we can use and what needs to be written new?.

bobbyshaw commented 3 years ago

@rrajiv Yep, that sounds like a good place to start. If you can pull them all in and add a reference to the google doc, we can then do a mapping between topic and metric to see where we might need some other queries written.

rrajiv commented 3 years ago

@bobbyshaw - I got the ball rolling. Based on me going through the 2020 queries and updating them to 2021, I put a list out of what we have so far, after replacing 2020 with 2021. I hope to have some time early this week to go through the outline to find the gaps in our 2021 list

Link to PR: https://github.com/HTTPArchive/almanac.httparchive.org/pull/2300

rockeynebhwani commented 3 years ago

I can write this without any worry about how hard to go off and do (!), but is it worth manually inspecting a few sites that fall into the bucket of "has payments, not detected as ecomm site" to see if really an ecommerce site? I have no idea how many ecommerce sites would have payment processing hints on the home page (which is what HTTP Archive samples) vs the payments page, so no idea how many extra sites this would find, then no idea how many of those sites are what we would consider ecommerce (e.g. vs subscribing to a newspaper where you can sign up on the home page).

@alankent - Lately, I have noticed 3 things -

@rrajiv - I think you should query Sites where category = 'Payment Processors' but at the same time category != 'Ecommerce'. You will need a self join on technologies table but I think you are going to find Ecommerce sites which are not caught by our current queries... Can't say how many though but this should be a simple query.

rrajiv commented 3 years ago

It took me a while to come up with a query that I think works.. I tested it on the sample table first as Iam still waiting for access to the GCP project. Seems ok to me but i'll let everyone weigh in.

#standardSQL
SELECT
  DISTINCT(url)  
FROM
  `httparchive.sample_data.technologies_*` ha1
WHERE
  ha1.category = "Payment processors"
  AND NOT EXISTS (
  SELECT
    url,
    category
  FROM
    `httparchive.sample_data.technologies_*` ha2
  WHERE
    ha1.url = ha2.url
    AND (ha2.category = 'Ecommerce') )
GROUP BY
  url

Few examples based on the sample data set:

https://estudiafeliz.com/ https://academy.kiwanis.eu/ https://chinawok-schwetzingen.de/ https://freebirds.com/

rviscomi commented 3 years ago

@rrajiv are you on the Slack channel? Please DM me so I can get you added to the GCP project.

rrajiv commented 3 years ago

Hi all - I have been a little busy so I will start running the queries from this week onwards. As a start, I plan on starting with what we used in 2020 and run the same queries for 2021.

@rockeynebhwani - the above is a sample just to get a bunch of sites as examples which match payment but not ecomm. Is there anything you want to explore here?

@bobbyshaw - anything else you are looking for that was perhaps not addressed in the 2020 edition?

logicalphase commented 3 years ago

@rrajiv Thank you for the update. Much appreciated 😊

rockeynebhwani commented 3 years ago

@rrajiv - I ran your query on main tables and I found some issues with our crawl as I was seeing sites with payment processors but Ecommerce category was missing. @pmeenan spotted an issue with our crawl when we looked at some examples. Issue raised here - https://github.com/HTTPArchive/cwv-tech-report/issues/25. Basically, cookie based detection in Wappalyzer was not working. This was the case for all technologies and not only for Ecommerce platforms. For example, OpenCart as a platform was never detected as we have only cookie based detection for this platform. This has possibly impacted stats for all other platforms also but any revised stats post this fix will only be available in Sep-2021 crawl. I don't think we should wait till Sep-2021 and just carry on with July-2021 stats (knowing that cookie based detection doesn't work).

I was able to also spot many additional platforms from that list and have raised multiple issues with Wappalyzer (which will be useful for next year). I think think that checking for 'Payment Processors' will give you increased coverage but you will have to modify all queries and your queries will become more complex. Take a call how you want to proceed.

tunetheweb commented 3 years ago

But this is no different than last year is it @rockeynebhwani ? Or is this a new problem only affecting this years stats?

rviscomi commented 3 years ago

Couldn't we analyze the cookie data in the requests table to reproduce the missing detection in July?

rockeynebhwani commented 3 years ago

@tunetheweb - Yes .. after going through Pat's fix, I can say that this problem affected last year's stats also (Not only for Ecommerce but for CMS chapter also possibly).

@rviscomi - It will be very cumbersome to do this using requests table.. too many Ecommerce platforms and many have cookie based detection in addition to other types.. I think we should analyse based on July-2021 and may be run key queries for Sep-2021 and see if any key findings change and adjust if necessary. It's very likely that even with this bug, most findings remain same.

tunetheweb commented 3 years ago

SGTM.

Looking at Wappalyzer's ecommerce stats it's likely OpenCart is in the top 10 so may want to consider if we should still show the top 10 list like we did last year without this (e.g. maybe show a top 5 instead?), or continue with top 10 and explain it's absence (which might cause more confusion than it avoids!).

rrajiv commented 3 years ago

Hi all,

I have completed an initial run and updated the queries for 2021.

spreadsheet: https://docs.google.com/spreadsheets/d/1HCfrXJ52lV46UvxOvDVjJj70fOeFVrTTD8DUm0tPVXE/edit#gid=1519192867

I'll start filling in the comments into the cell this week onwards.

Iam using the 2020 Ecommerce diagram titles as a reference to 2021 so that we can connect whatever that has been done in 2020 to the new data in 2021

For reference, this is the 2020 chapter: https://almanac.httparchive.org/en/2020/ecommerce

Here is a description of sheet and the data it covers with reference to the previous year's figure number

@bobbyshaw - let me know if there are other areas to explore and I can start figuring out the queries for those.

rockeynebhwani commented 3 years ago

@rviscomi - While looking at results put together by @rrajiv, I noticed that number of origins for eCommerce platforms like WooCommerce, Shopify etc. don't match with Core Web Vitals Technology report.

These are results from @rrajiv's spreadsheet (queries were for July-2021 table)

image

Number of origins for top platforms from Core Web Vitals technology report

image

Any idea why the difference?

rviscomi commented 3 years ago

I don't have access to @rrajiv's queries so I'll defer to him. The CWV Tech Report query is available here.

rockeynebhwani commented 3 years ago

@rviscomi - This is @rrajiv's query.. He is just using 2020 queries for now... I just changed table to 2021 July and ran and I get results what @rrajiv has in his spredasheet

SELECT
  _TABLE_SUFFIX AS client,
  app,
  COUNT(DISTINCT url) AS freq,
  total,
  COUNT(DISTINCT url) / total AS pct
FROM
  `httparchive.technologies.2021_07_01_*`
JOIN
  (SELECT 
      _TABLE_SUFFIX,
      COUNT(DISTINCT url) AS total
    FROM 
      `httparchive.summary_pages.2021_07_01_*` 
    GROUP BY 
      _TABLE_SUFFIX)
USING (_TABLE_SUFFIX)
WHERE
  category = 'Ecommerce'
GROUP BY
  client,
  app,
  total
ORDER BY
  client DESC,
  pct DESC
LIMIT 1000
rrajiv commented 3 years ago

All the queries for 2021 are currently on this branch: https://github.com/HTTPArchive/almanac.httparchive.org/tree/ecommerce-sql-2021/sql/2021/ecommerce

rviscomi commented 3 years ago

I think the source of the discrepancy is that the July 2021 CrUX dataset was released a couple of weeks ago and we won't run the technology detections on it until the September 2021 HTTP Archive crawl. So the 100k difference may be attributed to websites that fell out of the CrUX dataset between May and July. In other words, the detections HTTP Archive ran in the July crawl were based on websites in the May CrUX dataset.

So for this chapter's purposes, you're correctly measuring the number of mobile ecommerce pages as 444,390. If you want to join that with the latest CrUX dataset though, for example to report on real-user Core Web Vitals performance, the intersection only contains 334,171 pages. Alternatively you could join it with the older May CrUX dataset, which should have data for 100% of the pages.

The CWV Tech Report dashboard favors the latest real-user CWV data rather than the latest lab-based technology detections, so it trades off coverage.

rrajiv commented 3 years ago

Hi all - let me know if there are any more queries to write/research/run. Thanks

bobbyshaw commented 3 years ago

Thanks @rrajiv. I'll try and get back to you in the next week on this.

bobbyshaw commented 3 years ago

Hey @rrajiv. Thanks for all your work on the metrics. I've been through the google doc and annotated where each one will be used. This is helpful to see where there might be some gaps.

If it's not too much bother, I'd be interested in getting stats on a few more technology categories:

There were a couple of other minor potential discussion points in the outline. Are either of these feasible?

If I wanted to get an idea of how ecommerce sites compare to the wider web, e.g. with regards to lighthouse scores and number of technologies detected, is this a metric easily available to us? Given how any performance or technology decision has a direct effect on revenue of ecommerce sites, it'd be interesting to see how they compared.

I've spotted a couple of other metrics that other chapter teams have created, e.g. payment request API but I assume we can just re-use those once they've got the results.

If you have any questions/comments just shout. Otherwise I look forward to getting the result data!

rrajiv commented 3 years ago

@bobbyshaw - for the following, I assume you just mean sites that are "Ecommerce" && one of the following?. I.e What you are looking for is "Top for Ecommerce sites"?

Top "JavaScript frameworks" category Top "JavaScript libraries" category Top CMS technology category Top "Page Builders” technology category Top “A/B testing” technology category. Top “Personalisation” technology category Top “Loyalty & Rewards” technology category Top “reviews” technology category Top “Translation” technology category Top “Buy now / pay later” technology category

I will look into these once I get the above stats. Likely we can re-use something from the security section regarding the content-security-policy header since they covered it last year: https://almanac.httparchive.org/en/2020/security#content-security-policy

There were a couple of other minor potential discussion points in the outline. Are either of these feasible? Use of link hrelang tags (to indicate international ecommerce) Use of content-security-policy header set? (report only/enforce)

Iam not aware of something easily available but Iam thinking along the lines of crafting a query that compares say lighthouse scores for ecommerce and non-ecommerce. I don't know yet if this will capture what we want.

If I wanted to get an idea of how ecommerce sites compare to the wider web, e.g. with regards to lighthouse scores and number of technologies detected, is this a metric easily available to us? Given how any performance or technology decision has a direct effect on revenue of ecommerce sites, it'd be interesting to see how they compared.

bobbyshaw commented 3 years ago

@rrajiv

What you are looking for is "Top for Ecommerce sites"? Yes, please.

Iam not aware of something easily available but Iam thinking along the lines of crafting a query that compares say lighthouse scores for ecommerce and non-ecommerce

I think this would be enough to be a point in the article. Basically this query without the category filter would tell us whether ecommerce sites are on average over or under performing the rest of the web.

Thank you!

rrajiv commented 3 years ago

@bobbyshaw - thanks!. I’ll work on these over the next few days and get back to you.