facebook / docusaurus

Easy to maintain open source documentation websites.
https://docusaurus.io
MIT License
56.73k stars 8.54k forks source link

Add lastmod to sitemap #2604

Closed jdevalk closed 8 months ago

jdevalk commented 4 years ago

🐛 Bug Report

The XML sitemaps currently output loc, changefreq and priority for every url set. I would propose dropping the changefreq and priority fields, as none of the search engines use these, and instead adding the lastmod field, with the last modification date of the file.

Have you read the Contributing Guidelines on issues?

Yes.

To Reproduce

(Write your steps here:)

  1. Open any DocuSaurus v2 sitemap :)

Expected behavior

The current output would be:

<url>
        <loc>https://developer.yoast.com/features/canonical-urls/api</loc>
        <changefreq>weekly</changefreq>
        <priority>0.5</priority>
</url>

(Write what you thought would happen.)

Actual Behavior

I propose changing it to:

<url>
        <loc>https://developer.yoast.com/features/canonical-urls/api</loc>
        <lastmod>2020-04-14T11:22:05+00:00</lastmod>
</url>

Your Environment

RDIL commented 4 years ago

I think this would be a good addition, but I do know web crawlers that use the priority field.

jdevalk commented 4 years ago

@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.

RDIL commented 4 years ago

Fair enough.

yangshun commented 4 years ago

Great idea! Thanks for the suggestion!

AvroraPolnareff commented 4 years ago

Hello! I want to help solve this issue. As I can see, there are several implementation options here:

  1. Should I leave the old tags and add new ones or replace them?
  2. Which date should be specified in the "lastmod" tag: the date of the last build of the project or the date of the last page change? If the second, are there any easier ways to do it?
RDIL commented 4 years ago

Most likely the last build time since even just tiny changes end up changing the chunk hashes, so its constantly being modified.

slorber commented 4 years ago

@RDIL FYI Webpack 5 might help to make the js chunks more "stable" (see my recent comment in https://github.com/facebook/docusaurus/issues/3383), we may try to migrate after i18n is ready.

Not sure what we should do for this date. Also not sure how the sitemaps plugin could access the "last modification date" of the page, as this plugin is decoupled from the others.

Is it mandatory to add it to the sitemaps? It could likely be easier to handle this by adding a meta directly on the page, otherwise, we'd have to find a way to provide such metadata per path to the sitemap plugin.

Asking this, because for my work on i18n I'll also have to think about how to set up useful headers for localization (hreflang), and thought about adding them to the page directly instead of the sitemaps.

@jdevalk as it seems you know more about SEO than the rest of us, can you give us some insights?

jdevalk commented 4 years ago

Last modified is somewhat of a must for XML sitemaps indeed.

I think for hreflang I'd go for adding it to the page instead of the XML sitemaps as that makes debugging a lot easier and maybe even makes it accessible to other features within docusaurus, like a language switcher.

slorber commented 4 years ago

Thanks, will do that.

About lastModified, some plugins already read git history to get the last modified date. We can enable also to hardcode it through frontmatter.

I think we should:

If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).

We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?

jdevalk commented 4 years ago

If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).

I would not add it then. Having it change all the time when it's actually not changing is also not beneficial.

We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?

Agreed.

Ali-Shafiyev commented 1 year ago

Hi! Make the suggested changes to the code that generates the XML sitemaps. Test the changes locally to ensure the desired structure with the lastmod field is generated.

saul-data commented 1 year ago

This would be super useful as we are busy automating spell checking and grammar using AI. I was hoping to use the lastmod to understand when a page has changed to do a spell check and grammar check before deploying to live. I wouldn't want to do this for the entire website.

I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.

Maybe it can be an input in Layout tag:

<Layout title="Dataplane Data &amp; Automation Platform | Open Source" lastmod="2020-04-14T11:22:05+00:00">
jdevalk commented 1 year ago

While I understand @saul-data has different needs, for SEO / crawl efficiency reasons I’d only change the lastmod when the content changes. I’d say basing it on the lastmod date of the underlying source document is probably easiest.

Note that search engines are putting more emphasis on adding lastmod as of recently, so I’d prioritize this issue a bit higher.

saul-data commented 1 year ago

Would this be linked to https://docusaurus.io/docs/blog#blog-post-date ?

I couldn't see a date reference for pages and docs (only versions).

I feel this should be an input by the user when the content or page has changed.

slorber commented 1 year ago

Note: there's a related issue to add an explicit last update date for blog posts, that could be used as the sitemap lastmod

https://github.com/facebook/docusaurus/issues/8657

pmarschik commented 1 year ago

I have a prototype for adding <lastmod> to the sitemap.xml here https://github.com/facebook/docusaurus/pull/9234/files.

@slorber Is this how you envisioned the feature in https://github.com/facebook/docusaurus/issues/2604#issuecomment-715414977?

johnnyreilly commented 9 months ago

I solved this problem for my own site with a post build script; I blogged about it here: https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date

scaleoutsean commented 9 months ago

@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.

https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date#updated-12th-november-2023-googles-view-on-lastmod-changefreq-and-priority

jdevalk commented 9 months ago

Yeah I’m sorry it’s basically a requirement now.

slorber commented 8 months ago

Hey

We have merged support for git/front matter last update metadata for blog posts (https://github.com/facebook/docusaurus/issues/8657) which now means both blog and docs have unified support for this feature. (note that the pages plugin doesn't have support, although we could also add it there)

Now is a good time to add "lastmod" to the sitemap as well.

I'll review your PR soon @pmarschik, sorry for the delay.

In the meantime let's decide what should be implemented exactly here, using the Google sitemap doc as a ref: https://developers.google.com/search/blog/2023/06/sitemaps-lastmod-ping#the-lastmod-element


I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.

@saul-data this is not what we will implement because it's not what Google recommends:

And when we say "last modification", we actually mean "last significant modification". If your CMS changed an insignificant piece of text in the sidebar or footer, you don't have to update the lastmod value for that page.


I would propose dropping the changefreq and priority fields

@jdevalk I'd rather keep them for now, and maybe we'll remove those later. I guess we can consider the removal as a breaking change? 🤷‍♂️


I solved this problem for my own site with a post build script; I blogged about it here: johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date

@johnnyreilly note that your solution filters pages from the sitemap such as the tags and paginated lists pages, since they do not match your regexp pattern.

To implement this feature properly, we should also consider that there isn't always a Markdown document per sitemap URL, and some pages are also displaying multiple documents at once.

It's more difficult to define a "lastmod" date for those URLs for example:

My suggestion is to initially keep things simple, and only add a "lastmod" date when the page is backed by a Markdown document.

The Google doc says:

You can use a lastmod element for all the pages in your sitemap, or just the ones you're confident about. For instance, some site software may not be able to easily tell the last modification date of the homepage or a category page because it just aggregates the other pages on the site. In these cases it's fine to leave out lastmod for those pages.


Do we agree on this plan?

slorber commented 8 months ago

Something important to also consider: reading the file history from git is quite expensive (particularly for large sites), and we probably shouldn't do this by default unless the user wants to.

We only read from git when the showLastUpdateTime: true plugin option is provided, which means only in that case we would add the "lastmod" field to the sitemap.

Is it a problem? Are some of you looking to have lastmod in the sitemap, and yet do not want to use the showLastUpdateTime: true option?

I'd like to refactor the APIs and do breaking changes to make things less confusing, but I wonder if having the behavior above (a bit awkward) can be a problem to some of you?

wparad commented 8 months ago

Is it a problem? Are some of you looking to have lastmod in the sitemap, and yet do not want to use the showLastUpdateTime: true option?

If you are using either the sitemap OR showLastUpdateTime then it should work, it doesn't make sense to require showLastUpdateTime to be set, that property has nothing to do with RSS feeds/SEO, coupling those together just will be confusing for everyone.

johnnyreilly commented 8 months ago

Decent plan - happy with it. Do the breaking changes - good default

slorber commented 8 months ago

Thanks for your feedback

Agree @wparad, will try to find a solution so that the sitemap lastmod can be used independently from the docs/blog plugin options, and yet we need to avoid reading twice the lastmod date from Git for performance reasons (this can be expensive for thousands of files)

slorber commented 8 months ago

New sitemap options are implemented in PR, ready to review: https://github.com/facebook/docusaurus/pull/9954

{
  lastmod: null | 'date' | 'datetime'
  priority: null,
  changefreq: null,
}

Example with our Docusaurus website sitemap: https://deploy-preview-9954--docusaurus-2.netlify.app/sitemap.xml

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
    xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
    <url>
        <loc>https://docusaurus.io/blog/</loc>
    </url>
    <url>
        <loc>https://docusaurus.io/blog/2017/12/14/introducing-docusaurus/</loc>
        <lastmod>2023-01-05</lastmod>
    </url>
    <url>
        <loc>https://docusaurus.io/blog/2018/04/30/How-I-Converted-Profilo-To-Docusaurus/</loc>
        <lastmod>2023-01-04</lastmod>
    </url>
    <url>
        <loc>https://docusaurus.io/blog/2018/09/11/Towards-Docusaurus-2/</loc>
        <lastmod>2023-04-21</lastmod>
    </url>
    <url>
        <loc>https://docusaurus.io/docs/versioning/</loc>
        <lastmod>2024-01-04</lastmod>
    </url>
    <url>
        <loc>https://docusaurus.io/</loc>
        <lastmod>2023-10-31</lastmod>
    </url>

    !-- ... Other URLs, this is just a sample -->
</urlset>

You will notice that not all the URLs have a lastmod attribute (ex /blog/, on purpose, according to Google guidelines above.

For now, I'm not changing defaults in Docusauurs v3, and the base sitemap for existing sites will stay the same as before. However, these options should help you remove priority + changefreq, and add lastmod. I do agree that according to Google recommendations, using the exact same priority and changefreq for all the URLs is kind of an anti-pattern, and we are likely to remove these options in V4.

The sitemap plugin will use in priority the route metadata lastModifiedAt provided by plugins (and our 3 content plugins eventually add that metadata).

But the sitemap plugin can also work in isolation, and will also call git history in case lastmod !== null and plugins did not provide the lastModifiedAt route metadata information. This way, we call at most once the git history per source file, instead of potentially doing twice the same expensive call.

Does it look good to you, or do you see any issues with the implementation above?

johnnyreilly commented 8 months ago

This seems pretty good. I note that lastmod is date only, not datetime. I used datetime on my handrolled implementation:

<url>
<loc>https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date</loc>
<lastmod>2023-11-12T08:33:51+00:00</lastmod>
</url>

I suspect the time portion isn't that important. Most blogs won't be meaningfully updated more than once a day and crawlers may run less frequently than that.

Looks good!

slorber commented 8 months ago

Thanks for the review

You can choose either date or datetime plugin option, formatted differently:

const LastmodFormatters: Record<LastModOption, LastModFormatter> = {
  date: (timestamp) => new Date(timestamp).toISOString().split('T')[0]!,
  datetime: (timestamp) => new Date(timestamp).toISOString(),
};

That date is "relative" and only help Google prioritize page crawls within your own site, so I will probably use "date" as a default in v4. datetime takes more space, and I doubt the default Docusaurus sites are updated enough for time to be useful. So if you want datetime, it will remain opt-in.

johnnyreilly commented 8 months ago

I think I'll stick with the default of date - nice to have options though.

slorber commented 7 months ago

Hey, not related to lastmod, but should Docusaurus supports sitemap images?

Apparently, this is a thing:

johnnyreilly commented 7 months ago

Oh wow! Never heard of this. Despite all the links, I can't work out if there's a compelling reason to have them. Hmmmmm

slorber commented 7 months ago

Yes 😄 TIL there are also video and news sitemap in @stefanjudis article: https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/

I'm not sure it's worth supporting officially or by default, but we could do like the blog plugin and let users provide a createSitemapItem hook to add extra attributes if they want to? 🤷‍♂️

johnnyreilly commented 7 months ago

I think the hook is a good idea - I already manually amend my sitemap to exclude tags and pagination pages. Having a hook in the box would support that use case as well as this.

johnnyreilly commented 7 months ago

This made me laugh BTW: 🤣

Will I now drop everything and add these to all my sites? Naaaah, I think I'm fine.

https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/