getzola / zola

A fast static site generator in a single binary with everything built-in. https://www.getzola.org
https://www.getzola.org
MIT License
12.96k stars 919 forks source link

Paginated pages should be absent from sitemap.xml #2527

Open pranitbauva1997 opened 2 weeks ago

pranitbauva1997 commented 2 weeks ago

Bug Report

I am using pagination to show blog posts. I have enabled sitemap.xml. I use ahrefs for SEO site audits. It shows me this issue which arises because my sitemap has links for paginated pages which it shouldn't since the links for all blog posts and the link for the list of blog post is there.

Environment

Zola version: 0.18.0

Expected Behavior

The /blog/ URL should be a part of sitemap.xml along with all the blog articles but not /blog/page/1/.

Current Behavior

The paginated pages, i.e. /blog/page/1/, are a part of sitemap.xml, which makes search engines' lives difficult, and it could possibly lead to issues in indexing.

Step to reproduce

Switch on pagination and sitemap generation.

Keats commented 2 weeks ago

Isn't the sitemap meant to have links to all pages of a site, including the paginated pages?

pranitbauva1997 commented 1 week ago

@Keats I use ahrefs for understanding how well my website is adapted for SEO and after I implemented pagination it shows me this error and my SEO score dropped from 100 to 97: https://help.ahrefs.com/en/articles/2652498-non-canonical-page-in-sitemap-error-in-site-audit

The paginated pages are non-canonical and the sitemap has the "blog list" page as well as the paginated pages. Currently I see that google has already indexed the paginated page.

I feel pagination is only for viewability for end-users (not overwhelm them with too many blog posts) and not for search engines. I am of the opinion that the search engine should index the "blog list" page (/blog/ for me) and all individual blog posts while leaving out the paginated pages since it's not useful as a search result.

I use site:bauva.com to see all the pages that are indexed by the search engine.

I am happy to contribute a PR once the community decides whether to go forward with this and the approach. I am leaning towards having a configuration variable in config.toml which specifies whether paginated pages should be a part of sitemap.xml with default as it should be present so that it doesn't change the current behaviour. Interested in knowing what are the other approaches for introducing this feature.

Apologies for the late reply. Would be more prompt in my replies going forward.

Keats commented 1 week ago

You can already use a different template for sitemap.xml but in that case isn't the issue that the template should declare itself as self-canonical? Looking at https://seranking.com/blog/pagination/ we could also set some HTML headers for the paginated pages to ignore them.

pranitbauva1997 commented 1 week ago

@Keats The article you shared has detailed information regarding this. Learnt a few new things as well. Thanks.

Regarding using custom sitemap.xml, I think this is something most of the people should implement because most of them would care about SEO but they wouldn't want to go customise sitemap.xml . Also, I think we still have to introduce some variables for the custom sitemap to know whether a particular page should be included or not. One website can have many categories like "blogs" or "annual reports" (in my case) for pagination and I would have to go make the changes everywhere. I also can't think of a case where I will want it for "blogs" but not for "annual reports".

I think having HTML tags and robots.txt are also crucial since all three (inc. sitemap.xml) affect search engine indexing. I have a _base.html template where I have currently marked every page as to be indexed using the meta tag, I am not sure how to single out paginated page which has to be marked as noindex, though I am sure that this also needs to be a part of the PR.

The robots.txt change is trivial and there is no source code change required in this repo. Each user has to make changes at their end.

For documenting this feature, I am thinking of introducing a "SEO" section in the getzola.org while also including that variable in configurations page. In the SEO section, we can explain the users why we have this feature and how to utilise this along with HTML and robots.txt changes.

What are your thoughts?

Keats commented 1 week ago

That could be added to the config, but we would need to find a good name for it, or maybe make a new section in it.

pranitbauva1997 commented 1 week ago

@Keats Let me start work on this and come up with a draft PR soon.