Closed jdevalk closed 8 months ago
I think this would be a good addition, but I do know web crawlers that use the priority field.
@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.
Fair enough.
Great idea! Thanks for the suggestion!
Hello! I want to help solve this issue. As I can see, there are several implementation options here:
Most likely the last build time since even just tiny changes end up changing the chunk hashes, so its constantly being modified.
@RDIL FYI Webpack 5 might help to make the js chunks more "stable" (see my recent comment in https://github.com/facebook/docusaurus/issues/3383), we may try to migrate after i18n is ready.
Not sure what we should do for this date. Also not sure how the sitemaps plugin could access the "last modification date" of the page, as this plugin is decoupled from the others.
Is it mandatory to add it to the sitemaps? It could likely be easier to handle this by adding a meta directly on the page, otherwise, we'd have to find a way to provide such metadata per path to the sitemap plugin.
Asking this, because for my work on i18n I'll also have to think about how to set up useful headers for localization (hreflang), and thought about adding them to the page directly instead of the sitemaps.
@jdevalk as it seems you know more about SEO than the rest of us, can you give us some insights?
Last modified is somewhat of a must for XML sitemaps indeed.
I think for hreflang I'd go for adding it to the page instead of the XML sitemaps as that makes debugging a lot easier and maybe even makes it accessible to other features within docusaurus, like a language switcher.
Thanks, will do that.
About lastModified, some plugins already read git history to get the last modified date. We can enable also to hardcode it through frontmatter.
I think we should:
addRoute
apis with lastModified: lastModifiedFrontmatter || lastModifiedGit || lastModifiedFS || undefined
If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).
We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?
If this info can't be obtained (pages might not be generated from FS files), is it better to not add the lastmod entry, or to fallback to build time (which is likely to be a recent value if the site is built often).
I would not add it then. Having it change all the time when it's actually not changing is also not beneficial.
We agree that this date should rather be updated when the content change, but not when the code (ie the layout rendering the content etc) change?
Agreed.
Hi! Make the suggested changes to the code that generates the XML sitemaps. Test the changes locally to ensure the desired structure with the lastmod field is generated.
This would be super useful as we are busy automating spell checking and grammar using AI. I was hoping to use the lastmod to understand when a page has changed to do a spell check and grammar check before deploying to live. I wouldn't want to do this for the entire website.
I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.
Maybe it can be an input in Layout tag:
<Layout title="Dataplane Data & Automation Platform | Open Source" lastmod="2020-04-14T11:22:05+00:00">
While I understand @saul-data has different needs, for SEO / crawl efficiency reasons I’d only change the lastmod when the content changes. I’d say basing it on the lastmod date of the underlying source document is probably easiest.
Note that search engines are putting more emphasis on adding lastmod as of recently, so I’d prioritize this issue a bit higher.
Would this be linked to https://docusaurus.io/docs/blog#blog-post-date ?
I couldn't see a date reference for pages and docs (only versions).
I feel this should be an input by the user when the content or page has changed.
Note: there's a related issue to add an explicit last update date for blog posts, that could be used as the sitemap lastmod
I have a prototype for adding <lastmod>
to the sitemap.xml here https://github.com/facebook/docusaurus/pull/9234/files.
@slorber Is this how you envisioned the feature in https://github.com/facebook/docusaurus/issues/2604#issuecomment-715414977?
I solved this problem for my own site with a post build script; I blogged about it here: https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date
@RDIL such as? Honestly I’ve been doing SEO for well over a decade, not seen it used in the last 5 years.
Yeah I’m sorry it’s basically a requirement now.
Hey
We have merged support for git/front matter last update metadata for blog posts (https://github.com/facebook/docusaurus/issues/8657) which now means both blog and docs have unified support for this feature. (note that the pages plugin doesn't have support, although we could also add it there)
Now is a good time to add "lastmod" to the sitemap as well.
I'll review your PR soon @pmarschik, sorry for the delay.
In the meantime let's decide what should be implemented exactly here, using the Google sitemap doc as a ref: https://developers.google.com/search/blog/2023/06/sitemaps-lastmod-ping#the-lastmod-element
I don't think there should be a distinction between content change and layout change. If a specific page has changed then the lastmod should be updated with that date.
@saul-data this is not what we will implement because it's not what Google recommends:
And when we say "last modification", we actually mean "last significant modification". If your CMS changed an insignificant piece of text in the sidebar or footer, you don't have to update the lastmod value for that page.
I would propose dropping the changefreq and priority fields
@jdevalk I'd rather keep them for now, and maybe we'll remove those later. I guess we can consider the removal as a breaking change? 🤷♂️
I solved this problem for my own site with a post build script; I blogged about it here: johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date
@johnnyreilly note that your solution filters pages from the sitemap such as the tags and paginated lists pages, since they do not match your regexp pattern.
To implement this feature properly, we should also consider that there isn't always a Markdown document per sitemap URL, and some pages are also displaying multiple documents at once.
It's more difficult to define a "lastmod" date for those URLs for example:
My suggestion is to initially keep things simple, and only add a "lastmod" date when the page is backed by a Markdown document.
The Google doc says:
You can use a lastmod element for all the pages in your sitemap, or just the ones you're confident about. For instance, some site software may not be able to easily tell the last modification date of the homepage or a category page because it just aggregates the other pages on the site. In these cases it's fine to leave out lastmod for those pages.
Do we agree on this plan?
Something important to also consider: reading the file history from git
is quite expensive (particularly for large sites), and we probably shouldn't do this by default unless the user wants to.
We only read from git when the showLastUpdateTime: true
plugin option is provided, which means only in that case we would add the "lastmod" field to the sitemap.
Is it a problem? Are some of you looking to have lastmod
in the sitemap, and yet do not want to use the showLastUpdateTime: true
option?
I'd like to refactor the APIs and do breaking changes to make things less confusing, but I wonder if having the behavior above (a bit awkward) can be a problem to some of you?
Is it a problem? Are some of you looking to have lastmod in the sitemap, and yet do not want to use the showLastUpdateTime: true option?
If you are using either the sitemap OR showLastUpdateTime then it should work, it doesn't make sense to require showLastUpdateTime
to be set, that property has nothing to do with RSS feeds/SEO, coupling those together just will be confusing for everyone.
Decent plan - happy with it. Do the breaking changes - good default
Thanks for your feedback
Agree @wparad, will try to find a solution so that the sitemap lastmod can be used independently from the docs/blog plugin options, and yet we need to avoid reading twice the lastmod date from Git for performance reasons (this can be expensive for thousands of files)
New sitemap options are implemented in PR, ready to review: https://github.com/facebook/docusaurus/pull/9954
{
lastmod: null | 'date' | 'datetime'
priority: null,
changefreq: null,
}
Example with our Docusaurus website sitemap: https://deploy-preview-9954--docusaurus-2.netlify.app/sitemap.xml
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
<url>
<loc>https://docusaurus.io/blog/</loc>
</url>
<url>
<loc>https://docusaurus.io/blog/2017/12/14/introducing-docusaurus/</loc>
<lastmod>2023-01-05</lastmod>
</url>
<url>
<loc>https://docusaurus.io/blog/2018/04/30/How-I-Converted-Profilo-To-Docusaurus/</loc>
<lastmod>2023-01-04</lastmod>
</url>
<url>
<loc>https://docusaurus.io/blog/2018/09/11/Towards-Docusaurus-2/</loc>
<lastmod>2023-04-21</lastmod>
</url>
<url>
<loc>https://docusaurus.io/docs/versioning/</loc>
<lastmod>2024-01-04</lastmod>
</url>
<url>
<loc>https://docusaurus.io/</loc>
<lastmod>2023-10-31</lastmod>
</url>
!-- ... Other URLs, this is just a sample -->
</urlset>
You will notice that not all the URLs have a lastmod attribute (ex /blog/
, on purpose, according to Google guidelines above.
For now, I'm not changing defaults in Docusauurs v3, and the base sitemap for existing sites will stay the same as before. However, these options should help you remove priority
+ changefreq
, and add lastmod
. I do agree that according to Google recommendations, using the exact same priority
and changefreq
for all the URLs is kind of an anti-pattern, and we are likely to remove these options in V4.
The sitemap plugin will use in priority the route metadata lastModifiedAt
provided by plugins (and our 3 content plugins eventually add that metadata).
But the sitemap plugin can also work in isolation, and will also call git history in case lastmod !== null
and plugins did not provide the lastModifiedAt
route metadata information. This way, we call at most once the git history per source file, instead of potentially doing twice the same expensive call.
Does it look good to you, or do you see any issues with the implementation above?
This seems pretty good. I note that lastmod
is date only, not datetime. I used datetime on my handrolled implementation:
<url>
<loc>https://johnnyreilly.com/adding-lastmod-to-sitemap-git-commit-date</loc>
<lastmod>2023-11-12T08:33:51+00:00</lastmod>
</url>
I suspect the time portion isn't that important. Most blogs won't be meaningfully updated more than once a day and crawlers may run less frequently than that.
Looks good!
Thanks for the review
You can choose either date
or datetime
plugin option, formatted differently:
const LastmodFormatters: Record<LastModOption, LastModFormatter> = {
date: (timestamp) => new Date(timestamp).toISOString().split('T')[0]!,
datetime: (timestamp) => new Date(timestamp).toISOString(),
};
That date is "relative" and only help Google prioritize page crawls within your own site, so I will probably use "date" as a default in v4. datetime takes more space, and I doubt the default Docusaurus sites are updated enough for time to be useful. So if you want datetime, it will remain opt-in.
I think I'll stick with the default of date
- nice to have options though.
Hey, not related to lastmod, but should Docusaurus supports sitemap images?
Apparently, this is a thing:
Oh wow! Never heard of this. Despite all the links, I can't work out if there's a compelling reason to have them. Hmmmmm
Yes 😄 TIL there are also video and news sitemap in @stefanjudis article: https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/
I'm not sure it's worth supporting officially or by default, but we could do like the blog plugin and let users provide a createSitemapItem
hook to add extra attributes if they want to? 🤷♂️
I think the hook is a good idea - I already manually amend my sitemap to exclude tags and pagination pages. Having a hook in the box would support that use case as well as this.
This made me laugh BTW: 🤣
Will I now drop everything and add these to all my sites? Naaaah, I think I'm fine.
https://www.stefanjudis.com/today-i-learned/image-video-news-sitemaps/
🐛 Bug Report
The XML sitemaps currently output
loc
,changefreq
andpriority
for everyurl
set. I would propose dropping thechangefreq
andpriority
fields, as none of the search engines use these, and instead adding thelastmod
field, with the last modification date of the file.Have you read the Contributing Guidelines on issues?
Yes.
To Reproduce
(Write your steps here:)
Expected behavior
The current output would be:
(Write what you thought would happen.)
Actual Behavior
I propose changing it to:
Your Environment