HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
613 stars 170 forks source link

SEO 2020 #908

Closed foxdavidj closed 3 years ago

foxdavidj commented 4 years ago

Part II Chapter 7: SEO

Content team

Authors Reviewers Analysts Draft Queries Results
@aleyda @ipullrank @fellowhuman1101 @clarkeclark @natedame @catalinred @aysunakarsu @ashleyish @dsottimano @dwsmart @en3r0 @Gathea @rachellcostello @ibnesayeed @max-ostapenko @Tiggerito @antoineeripret Doc *.sql Sheet

Content team lead: @aleyda

Welcome chapter contributors! You'll be using this issue throughout the chapter lifecycle to coordinate on the content planning, analysis, and writing stages.

The content team is made up of the following contributors:

New contributors: If you're interested in joining the content team for this chapter, just leave a comment below and the content team lead will loop you in.

Note: To ensure that you get notifications when tagged, you must be "watching" this repository.

Milestones

0. Form the content team

1. Plan content

2. Gather data

3. Validate results

4. Draft content

5. Publication

aleyda commented 4 years ago

@rockeynebhwani btw, I actually think we should be good already and didn't realized when writing my previous response somehow (!!) :D Based on what you were asking: We expect to collect all structured data following info...

So, we expect to gather all this information on one hand (which at this point is the most important aspect).

Then another matter is if we decide to highlight something further or not (we definitely want to give an overview of all) in the SEO chapter due to its Search prominence (that we will decide later on in the process) and we can share with you when we are at that point of the writing process, and if you see is not enough from an ecommerce perspective, and you would like to show something more/different, then you will still have the data to access and feature it in the ecommerce chapter? :) I think this will be the easiest process.

Thanks again!

rockeynebhwani commented 4 years ago

sounds good @aleyda .. @rviscomi pointed me to https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/2019/10_SEO/10_05.sql

Most likely, this data is already being gathered. Should be matter of just querying

aleyda commented 4 years ago

Amazing! That's great to know. Thanks @rockeynebhwani :)

rockeynebhwani commented 4 years ago

Correction. I did a run with all almanac custom metric but this information about social channels is not being gathered currently. Sample run (Look for '10.5') - https://www.webpagetest.org/custom_metrics.php?test=200717_WK_030e8ad4e4b08550d361cd871b9db617&run=1&cached=0

It will be good if somebody in team here can pick PR to almanac.js (https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js) to add this custom metric

Tiggerito commented 4 years ago

@rockeynebhwani @aleyda

A good idea to collaborate on this.

I agree that sameAs was dropped by Google so less significant now.

searchAction is still supported and I think the SD is used a lot. However, to get a search box is independent to adding the markup and very few sites get them. The markup is there to tell Google how to directly use the sites internal search system. A thought I had was to verify that the searchAction matches how the sites search actually works, but that may be difficult.

Logo is also a limited value markup, only used if the site has a knowledge panel entry.

Something important to mention is this analysis is on home pages only. Which is very limiting with SD as most types are only of value on internal pages (Product, Review, Recipe etc). Outside the above features you will likely never see a rich result on a home page. So if we do detect other entity types, it's probably a mistake, of no value, or an attempt to manipulate the system. Currently we use JS to identify all entity types present on a page (no properties yet, just the types used) and there is a report that checks that list against the ones used in the Google Gallery.

I agree that reporting on the different formats and maybe vocabularies used would be of value. I also think I can determine if SD was added via JavaScript (or that it changed).

I could probably pull out some specifics from the SD, like the search action URL and the logo. That could be used to verify things. I.e. is the logo on the page.

ibnesayeed commented 4 years ago

I see that this is a chapter with a handful of reviewers already, but crawlability of a site is an indicator of its archivability (which is my field of expertise), so I would be interested in reviewing this.

ibnesayeed commented 4 years ago

A quick suggestion, this chapter may also want to analyze soft-404s, which are an indication of poor application design and may hurt SEO. These can be hilarious at times.

Tiggerito commented 4 years ago

A quick suggestion, this chapter may also want to analyze soft-404s, which are an indication of poor application design and may hurt SEO. These can be hilarious at times.

I like the idea. Unfortunately we can only analyse home pages which are probably not candidates for soft 404 status.

It would be interesting to know how we could determine if a page may be classed as a soft 404? Google probably uses a few signals like heading, lack of content and redirects to work them out.

ibnesayeed commented 4 years ago

It would be interesting to know how we could determine if a page may be classed as a soft 404? Google probably uses a few signals like heading, lack of content and redirects to work them out.

In the web science community there are two very well known approaches to identify soft-404s and both have their pros and cons:

  1. Train a simple text-based classifier using page title and body text (all the markup removed, and ideally boilerplate template removed as well) from legitimate 200 and 404 pages as training sets and then use it to classify unknown 200 responses. This relies of the idea that there will be some common phrases in the main content area that will suggest that the item/page/resource/product is not available.
    • Pro: Works offline on crawled datasets without many any extra requests to the live site.
    • Con: Needs annotated gold data set for training in each language.
  2. Nudge the last segment (usually the last path segment, but in some cases it can be a query parameter as well) of a URL and make another request to the new URL (e.g., if one wants to know whether example.com/foo/bar is a soft-404, make another request to something like example.com/foo/blah) to observe whether the two responses are exactly/almost the same, this would likely be a soft-404 or some other error page, even if the response is 200 in both the cases.
    • Pro: Works across languages without any prior knowledge or classifier model.
    • Con: Increases the number of requests made to a server, hence it is suitable for big-scale crawlers who are already checking many URLs of a site and can compare their contents.

Note: There are implementations available for both techniques and some tools that are built on the first technique already come with pre-trained model (but allow training new models) to work out of the box, especially on English language web pages. I once worked on a small project with people who archived pages in Arabic language and they wanted to classify soft-404s, but the training sets were not available and they were analyzing archived data, so the sites might not be available live to try the second approach. I suggested them to use machine translation on Arabic pages to translate them in English and then use classifier models built for English language. I am not aware of any research work done on this "translate then classify" approach for soft-404 detection, but I did not have enough time to explore that further and publish a paper if it was indeed a novel work.

ibnesayeed commented 4 years ago

I like the idea. Unfortunately we can only analyse home pages which are probably not candidates for soft 404 status.

Yes, you are right, home pages are less likely to be soft-404s.

Tiggerito commented 4 years ago

That's me done for the day 🍷

I've created SQL to work out Core Web Vitals per device. This is actual data for last month. Should be simple to do one per country:

image

I've also been playing with processing the raw html. These are how many JSON_LD script tags are added (sample 10k data). The site with 50 is marking up every image in a separate script tag:

image

And you might like this. I managed to parse the JSON_LD scripts from the raw html to see if they were valid json. 6 failed. I was pleasantly surprised.

image

Many more ideas. Need to work on microdata etc. As well as gathering the same info from the rendered pages. Finally, a comparison of numbers between raw and rendered would be a good clue that they are altering SD via JS.

aleyda commented 4 years ago

I see that this is a chapter with a handful of reviewers already, but crawlability of a site is an indicator of its archivability (which is my field of expertise), so I would be interested in reviewing this.

Hi @ibnesayeed , there are many already but I will definitely add you since in the "pings" I sent above to them asking if they wanted to keep as reviewers or switch to analyst many didn't answered at the end, so I'm a bit concern about the availability of all who had initially registered :D If at the end we see everybody answers and collaborates we can even split areas/topics, but if not, it would be great to have more available people! Thanks :)

aleyda commented 4 years ago

@Tiggerito This is amazing! Regarding the Core Web Vitals: would it be possible also to add "tablet" as part of the devices or is it possible only to get it for mobile and desktop?

About the Json-LD parsing: This is great! We can then specify how many of the Websites implementing their SD with this method have "basic" script configuration issues.

and then: " a comparison of numbers between raw and rendered would be a good clue that they are altering SD via JS." - this would be amazing, and also very well worthy to do with meta robots and metadata configurations: the difference between the raw vs. rendered HTML!

aleyda commented 4 years ago

I'd be happy to help as an analyst. If there is enough analysts for this chapter, happy to help with another one as well :)

Hi @antoineeripret - I just saw this message you had sent and I had overlooked (sorry) and hadn't replied. I think an additional analyst would be amazing since @Tiggerito has started going through the data and we're looking to have a data rich chapter (we already started to define the topics here) - let me know if you're still interested to be an analyst of the chapter and then I can add you too, so you can start also seeing the viability of the metrics connected with the topics we're proposing to include in the chapter with @Tiggerito :) Looking forward to your response - thanks again!

aleyda commented 4 years ago

I can contribute as an analyst too if there is still need at that part. Thanks.

Hi @aysunakarsu ! Would you still available as an analyst too? Let me know and if so I will add you as one (instead of a reviewer) as I think we will need more people there and we have already many reviewers! @Tiggerito has started to go through the data of the chapter topics we're leaving specified here. Let me know please if you're still interested, and if so, your email to add you to the Google Docs to contribute too :) Thanks again!

Tiggerito commented 4 years ago

@aysunakarsu @antoineeripret I'd love backup on this. I'm not an analyst, I just play one on TV.

aleyda commented 4 years ago

@aysunakarsu @antoineeripret I'd love backup on this. I'm not an analyst, I just play one on TV.

@antoineeripret has just confirmed me directly, added him to Google Docs and our private conversation too! He'll be able to start digging in tomorrow :) Looking forward to @aysunakarsu response to see if she's interested and then @max-ostapenko who had confirmed already as an analysts to get him involved too! Thanks for all your effort, @Tiggerito :D

aleyda commented 4 years ago

Hi @obto @rviscomi,

One question: When going through some of the information we want to show in the outline & metrics, like the percentage of sites implementing certain configurations with @Tiggerito , we saw that it could be useful to instead of sharing only the % of the total sites that are implementing something in X or Y way, in certain cases to provide a comparison of the usage with the top X / highest traffic sites from the Web, to give a perspective of the usage in "most important" sites. We could grab this "top websites" data from third-party tools like SEMrush or SimilarWeb, although I don't know if it's ok to refer to third-party tools as "sources" for this type of information. If not, would there be another way to do it that complies with best practices of the Web Almanac, taking into consideration what we want to achieve?

Thanks for your feedback :)

foxdavidj commented 4 years ago

@aleyda Should be possible. We can import data from the Majestic million and segment various metrics based on the websites ranking πŸ‘

rockeynebhwani commented 4 years ago

@obto Majestic Million is paid.. isn't it ?

foxdavidj commented 4 years ago

@rockeynebhwani nope. The download of all the sites rankings is available right on the homepage of their site i linked

Tiggerito commented 4 years ago

Analystis @max-ostapenko @antoineeripret I'm finishing early today (it's noon here) and want to get things out there so others can play. Where I'm at:

I've created a draft pull request where we can add all our new database queries. You should be able to see them here, and hopefully contribute. This is my first time using GitHub with others, so I'm not sure of the dynamics.

At the moment it is mostly last years scripts that have been renamed. The pull request is where I'm chatting with Rick and learning how to do this. He's already provided some change suggestions like updating the main comment and changing the file names.

I've also created a fork to edit the almanac.js file (data from the rendered page). I've added some starter properties and may do a pull soon so we can get feedback from the bosses.

I've been testing it with this simple page. Open the developer console, refresh and you should see an object which is what will be used. You could copy and edit the file to test different things. Or copy the whole script section into the console of any page. I just did it with twitter. 😎

image

I think it would be worth working out a console script to speed up testing using WebPageTest which powers the real crawl. Instructions to use that are here.

Note that we have to get this script completed and merged before the end of the month. So we need to tie down the metrics we will get from it.

antoineeripret commented 4 years ago

@Tiggerito: Thank you for the update. That's great !

I will have a look at your updated almanac.js this afternoon and see, from the Google Docs document we have, if we are missing some metrics that @aleyda needs. I recall seeing yesterday that she wanted to compare h1 and title, for instance, therefore we need to get the data for her beforehand.

You should have a pull request on your fork by myself when you wake up tomorrow, with the work done today.

aleyda commented 4 years ago

@aleyda Should be possible. We can import data from the Majestic million and segment various metrics based on the websites ranking πŸ‘

This is great and will serve well for what we want to do, in this case the sites are ranked by the number of linking subnets but should be ok to give a sense of "most popular" sites too. Will now specify also in the metrics when we want this type of additional "top sites" segmentation. Thanks again!

aleyda commented 4 years ago

Hi @rviscomi - Regarding the Rendering constrains: I see that @Tiggerito wanted to ask you something and then also saw today @ipullrank had some ideas about the rendering constraints, but seems that we can't tag you on the Google Docs comment here. - Could you please check it out? Thanks again for all your help on this :)

Tiggerito commented 4 years ago

@Tiggerito This is amazing! Regarding the Core Web Vitals: would it be possible also to add "tablet" as part of the devices or is it possible only to get it for mobile and desktop?

I don't think I replied to this one (🍷 warning) the data does actually segment via tablet. I removed it because that's what they did before. I've added all devices back. They may have removed it as all other data sources only segment via mobile/desktop.

aleyda commented 4 years ago

@Tiggerito thank you! Yes, if it's possible it would be great to have the tablet segment for the Core Web Vitals :)

aleyda commented 4 years ago

Hi again @obto - when going through the video inclusion/optimization we were wondering about the best method to identify their usage as unlike the images, the video tag is not used in most cases as videos are inserted using third party embeds, like YouTube, Wistia, Vimeo, etc. Should we look for these scripts inclusions to verify the usage? We were wondering how is this done in the Media chapter when specifying Video usage? Thanks in advance!

cc @fellowhuman1101 @ipullrank @Tiggerito @antoineeripret

antoineeripret commented 4 years ago

@aleyda Should be possible. We can import data from the Majestic million and segment various metrics based on the websites ranking πŸ‘

Hi @obto, another question for you: can we define where (in your BQ dataset) this data will be imported? Just to know how to access this information and include it in our queries with @Tiggerito.

Thanks.

foxdavidj commented 4 years ago

@antoineeripret I'm in talks with Rick and Paul about adding it, and if we should use Majestic Million or Cisco's Umbrella Million. The data would live somewhere like httparchive.almanac.majestic_million though.

Does that help?

foxdavidj commented 4 years ago

@aleyda If i recall correctly, we did not detect videos from the HTML precisely for this reason. Instead we analyzed the requests the browser made and checked if the mime type of the file was for a known video format. Here's a link to the Media chapter's queries

aleyda commented 4 years ago

@aleyda If i recall correctly, we did not detect videos from the HTML precisely for this reason. Instead we analyzed the requests the browser made and checked if the mime type of the file was for a known video format. Here's a link to the Media chapter's queries

Thanks @obto! This is interesting indeed, @antoineeripret @Tiggerito could you please see if we could do the same for validating the the existence of videos? And then besides this it would be to check the usage of the VideoObject structured data :)

rviscomi commented 4 years ago

Thanks for the ping. I've replied to the rendering thread. I already get notifications for all comments so that might be why you couldn't @ me in the doc.

Regarding Majestic Millions and other ranked datasets, I want to urge caution that new datasets added to the methodology should be reviewed first. We've avoided approximating site popularity in the past due to incompatibilities with the HTTP Archive's sample set, which is based on real-user data in the unranked Chrome UX Report. It's worth investigating Majestic's efficacy by seeing how many HTTP Archive URLs are covered. There's also a question of compatibility because HTTP Archive tests the home pages of full origins, but some site ranking datasets may only provide domain-level data like google.com as opposed to the full origin like https://maps.google.com. Given all that, I'd encourage you to assume that the data will be unranked. In parallel we can evaluate the efficacy of the ranked datasets.

Tiggerito commented 4 years ago

@aleyda If i recall correctly, we did not detect videos from the HTML precisely for this reason. Instead we analyzed the requests the browser made and checked if the mime type of the file was for a known video format. Here's a link to the Media chapter's queries

Thanks @obto! This is interesting indeed, @antoineeripret @Tiggerito could you please see if we could do the same for validating the the existence of videos? And then besides this it would be to check the usage of the VideoObject structured data :)

Looks likes they collected a video count from lighthouse data (04_01) as @obto said via mime types. I could duplicate that.

Tiggerito commented 4 years ago

For eCommerce chapter 2020, I was considering to pull a custom metric to find out how many social channels sites are using on average and publishing this information via Schema.org. I was planning to do this using a custom metric (Thanks to @savsav)

Hi @rviscomi,

How far did you get with this? I've just submitted a pull request with a new structured-data property. At the moment it just counts json-ld scripts and parses them for error. But my plan is to dig in a bit more as well as gather data from other formats like microdata.

I know a bit about SD and processing it with JavaScript (my tool), so could probably pull together something more reliable on anything you want. e.g. your example code assumed sameAs would be in the top object. That would fail for Yoast that uses a graph to define all entities.

With the limited resources we can't include a full blown structured data parser/validator. We will have to fudge things. e.g. as a simple solution I could parse the whole object tree (already done) and create an array of all sameAs properties in it. You won't know the context for them, but you could assume that a sameAs for a social website is related to that old Google feature. And it would be easy to pull in all microdata based sameAs at the same time.

I'm personally interested in pulling in the few properties that Google currently supports in their guidelines for home page based structured data: logo, site link search box and maybe local business properties like address and hours. Again, it will be quite dumb. The presence of the property will assume it is in the right place for what Google requires. I may include some very basic validation where it is easy.

Maybe the pull request is a good place to continue the conversation?

rviscomi commented 4 years ago

@Tiggerito where did that quote come from, and was it from me? πŸ˜…

Tiggerito commented 4 years ago

@Tiggerito where did that quote come from, and was it from me? πŸ˜…

@rviscomi I tagged the wrong name. It was from @rockeynebhwani πŸ€¦β€β™‚οΈ

@rockeynebhwani could your read this post... https://github.com/HTTPArchive/almanac.httparchive.org/issues/908#issuecomment-662211100

Tiggerito commented 4 years ago

FYI. I just volunteered to be the analyst for the Markup chapter πŸ€¦β€β™‚οΈ. There's a lot of overlap in the metrics, so it makes sense to add what they need to what we already get. It may also make me aware of other things of interest to us.

Tiggerito commented 4 years ago

So far I have added a lot of code to gather the metrics asked for. I'm not sure how we are going to review/test it to make sure it covers the requirements and fully works. There's a lot of logic and chosen data based on my own understanding of things. 7 days to go 😲

The pull request now indicates what's implemented and what I think I need to do:

https://github.com/HTTPArchive/legacy.httparchive.org/pull/171

catalinred commented 4 years ago

HTTPS pages commonly suffer from a problem called mixed content, where subresources on the page are loaded insecurely over http://. https://blog.chromium.org/2019/10/no-more-mixed-messages-about-https.html

Would mixed content be a good addition to this chapter?

I made a quick search of the doc and wasn't able to find anything related. We know that Chrome blocks mixed content so this can affect UX, CTR, and therefore SEO. Oh, and we have this Reddit answer from John Mueller.

Nevertheless, I'd still want to see how many HTTPS pages suffer from mixed content? What do you think?

Afaik, this is provided by Lighthouse's Best Practices, within the "Uses HTTPS" section so this might be easy to get stats for. cc @Tiggerito

aleyda commented 4 years ago

This is a great call @catalinred - and something I thought I had added along the HTTPS Usage but at the end didn't! Thanks for pointing out. Let's see what @Tiggerito and @antoineeripret say about viability to get this data and if it's possible, add it too :)

tunetheweb commented 4 years ago

Was measured in Security chapter last year: https://almanac.httparchive.org/en/2019/security#mixed-content so definitely viable.

Probably still belongs there more so than this chapter IMHO as, AFAIK, there is no direct SEO impact of mixed content per se.

Fine to have it in both chapters if important to both but, at the same time, do want to concentrate on what’s really important in each chapter’s topic if it’s only partially relevant β€” and also to avoid duplicating lots of content.

On the other hand if you do feel it belongs then don’t let me hold you back! Query will likely be written for Security chapter and, if not, then have last year’s query as a basis so should be easy to get the data.

aleyda commented 4 years ago

Thanks @bazzadp - Really appreciate your feedback!

What do you think about this @ipullrank @fellowhuman1101 @Tiggerito @antoineeripret?

Tiggerito commented 4 years ago

I'm happy to let the Security guys work it out πŸ˜‰

Tiggerito commented 4 years ago

I'm trying to test a few things. Can anyone provide example URLs for the following:

Using canonical link in the http header Using x-robots-tag in the http header Using googlebot specific x-robots-tag in the http header Using googlebot specific robots meta tag

etc. Especially complex examples of robots and canonical use.

rviscomi commented 4 years ago

Just popping in to say that you all are doing an amazing job on this chapter! Really looking forward to seeing the first draft! πŸš€

Tiggerito commented 4 years ago

And hreflang in the header.

antoineeripret commented 4 years ago

@Tiggerito : You can have a look at https://www.simonelectric.com/ for canonical & hreflang in the HTTP Header. I'm asking around for the other cases because I'm not aware of any.

Tiggerito commented 4 years ago

@antoineeripret that's a great example. Helped me test quite a few things.

I now have what I think is a complete script. But I need help in validating it and double checking it's the data we need.

Here's the zipped up json output for https://www.simonelectric.com/

simonelectric.zip

Some things about it:

We managed to work out how to get the raw html without needing to access the expensive table. I get it injected into the script 😎 Not only that, but I also have access to the headers. 🀯

In the file you will see references to raw/rendered/headers versions of the data.

This is the data everyone gets, so there will be stuff we don't care about.

I can add logic to make it easier to extract information. e.g. I added a canonical_missmatch bool to tell us if there were two different canonical values specified via headers/raw or rendered content. Tell me if there is anything you want.

Maybe we add a new status to the almanac.js column of the metric tables to indicate the data source has been tested and verified?

I can run tests, and teach others how to do it. There is a lot of logic going on here and less than a week to perfect it. There's a good chance some of the data ends up being wrong, but we have a lot of it to work from.

There will be other changes in the structure as I have to fold this in with the main script. But those changes should only add information.

I'm ready for a 🍷 😁

antoineeripret commented 4 years ago

Wow @Tiggerito, that's pretty impressive!! Amazing work you did !!

Maybe we add a new status to the almanac.js column of the metric tables to indicate the data source has been tested and verified?

100%, or use a color code, as you prefer :) We can start building SQL queries as the data source is defined and tested!