adobe / helix-home

The home of Project Helix
54 stars 82 forks source link

SEO: Automated XML & HTML sitemap generation #220

Open audi5 opened 2 years ago

audi5 commented 2 years ago

Site indexation (discoverability of content) is an important SEO impact area, as we need content to be crawled for SEO results, and XML Sitemaps is a good centralized way to do that. And, we need an HTML sitemap to manage site structures, clean up content, etc.

Need a XML and HTML sitemap generator for Adobe.com, HelpX, Acrobat and other sites currently on the Dexter Platform consolidating URLs from all AEM versions.

That will help SEO team and Authoring teams understand / see all the pages currently published and to analyze the structure and familiarize them with all the pages on the site.

For XML format that is needed for SEO purposes, we need: http://adobe-consulting-services.github.io/acs-aem-commons/features/simple-sitemap.html

Please also provide capability to add a url manually in case the page is not hosted on AEM.

We need a XML format for SEO and HTML format for Design HTML format – production publish instance URLs Outside of AEM 6.0 URLs need to be manually added to the list

Need to figure out a way to show only public URLs rather than full path URLs (/sitemap.xml) Can the sitemaps be split by geo – Yes http://www.adobe.com/robots.txt

Current sitemap urls on Adobe.com: https://www.stage.adobe.com/content/acom/us/en.sitemap.html?allowfullpath=true http://www.stage.adobe.com/content/acom/us/en.sitemap.xml?allowfullpath=true

Need to validate that we can generate regular XML Sitemap files with AEM that are designed to improve site indexation.

Requirements: Auto-generation of the XML sitemap files One sitemap file per geo - www.adobe.com/ca/sitemap.xml, www.adobe.com/uk/sitemap.xml Auto- publish new URLs into a sitemap file (approx within 1 hour - cache refresh time). Auto - removal of non-canonical URLs (3XX, 404s, should be wiped out from XML sitemaps.) Provide a way for authors to override the page url value that should show up in the sitemap instead of an absolute path for ex: the url that should show up for the home page should be www.env.adobe.com instead of www.env.adobe.com/index.html. For the above kind of pages, currently the workaround is to modify the xml manually but it would be nice to have a field provided using which the authors could mention the url that should be showing in the xml. Separate implementation of Remove from sitemap checkbox for html and xml sitemaps Automatically exclude non-html: https//www.adobe.com/1, https://www.adobe.com/1/creative-2015-07-20-mascha Enforce Removal for pages with meta robots noindex e.g. http://www.adobe.com/confirmation.html, https://www.adobe.com/search.html Include rewrite paths per the canonical tag. E.g. http://www.adobe.com/leaders.html (per canonical), not http://www.adobe.com/about-adobe/leaders.html (the actual resolving URL) List URLs in alphabetic order Possible to verify DNS to exclude pages that 404 on live site? https://www.adobe.com/qa_test_020.html Sitemap generated should be http but also be able to be generated on https. Sitemap generated should also take floodgated content that's available for visitors into account.

Acceptance Criteria: Sitemaps are generated / refreshed on the fly, when the page is accessed. verify new pages get added to author sitemaps when created verify pages get updated on author sitemaps when moved or renamed verify timestamps get properly updated in the XML sitemap when a page is updated/activated (author & publish) verify pages are added to publish sitemaps when activated and cache flushed Verify name update makes it to publish sitemaps verify deactivated pages are removed from pulish sitemaps Pages can be excluded from the sitemap via page property (or a similar place in helix) verify new pages are included in sitemaps by default verify that authors can remove pages from sitemap verify that page can be de-activated, and removed from publish sitemap verify child pages are also removed from sitemap (config is inherited, overriding inheritance was NOT tested verify that fragments folder can be excluded from the sitemap at the folder level, and properly inherited verify that the config is available on the FW, Lobby, Lobby tab, and fragment templates verify the HTML sitemap has meta: (???) Verify if the sitemaps are Floodgate aware and consider floodgated content as well that's visible tp end user.

rofe commented 2 years ago

References:

rofe commented 2 years ago

@dominique-pfister do you see anything missing in the current implementation?

dominique-pfister commented 2 years ago

@dominique-pfister do you see anything missing in the current implementation?

Looking at the list of Requirements above, most are already built-in or can be done by using a separate helix-sitemap sheet in the index. The following, though, are not available:

And we don't generate HTML sitemaps (yet)

rofe commented 2 years ago

Sitemap generated should be http

I don't think we should do anything other than https. It's 2021 😉