az-digital / az_quickstart

UArizona's web content management system built with Drupal 10.
https://quickstart.arizona.edu
GNU General Public License v2.0
30 stars 20 forks source link

News Importer - Prevent node from crawl, setup canonical(s) #3157

Open RyanDool opened 5 months ago

RyanDool commented 5 months ago

Problem/Motivation

This issue was discovered on arizona.edu, where the news importer is creating an accessible node and url on arizona.edu which could be seen as duplicate content as it matches exactly as what is posted on the news site.

Describe the bug

When a news story is imported a node is created on the subdomain with the same title, subtitle, image and page url as what exists on news.arizona.edu. An example can be seen here: Original Story posted on news.arizona.edu vs Imported Story

Proposed resolution

Resolutions include: adding a canonical tag to the node upon creation which points to the source of truth (news.arizona.edu/story/[article title]), making the node is inaccessible to users and bots.

trackleft commented 5 months ago

Possible duplicate of https://github.com/az-digital/az_quickstart/issues/2357

ewlyman commented 5 months ago

Thank you @trackleft I'll touch base with @bberndt-uaz, since I see #2357 went into the 2.10.0 minor release.

bberndt-uaz commented 5 months ago

Looking at the HTML for the both example pages linked in the description, I can see that the header on both pages already includes a canonical element pointing to the news.arizona.edu URL:

<link rel="canonical" href="https://news.arizona.edu/story/uarizona-leadership-presents-next-steps-financial-action-plan-focus-collaboration">

This Google documentation says:

While we encourage you to use these methods [of specifying a canonical preference], none of them are required; your site will likely do just fine without specifying a canonical preference. That's because if you don't specify a canonical URL, Google will identify which version of the URL is objectively the best version to show to users in Search.

Another method which that documentation recommends is including the canonical URL in a sitemap. The example news story is already included in news.arizona.edu's sitemap and the imported news story on arizona.edu is NOT included in arizona.edu's sitemap. I believe flexible pages are the only content type included in the sitemap by default in Quickstart.

Finally, I searched for the story on Google and I only see the news.arizona.edu link in the first ten results:

image

It seems to me that the imported news nodes are not causing an SEO issue, but I'm curious what others think.