Canadian-Geospatial-Platform / geo.ca

This is the production release content for GEO.CA. For a development release of GEO.CA please visit the link below | Ceci est le contenu de la version de production pour GEO.CA. Pour une version de développement de GEO.CA, veuillez visiter le lien ci-dessous : https://canadian-geospatial-platform.github.io/geo.ca/
https://geo.ca
2 stars 3 forks source link

Enable search engine indexing #55

Open jvanulde opened 1 year ago

jvanulde commented 1 year ago

In preparation of the official launch of Geo.ca in mid-November we need to ensure that search engine indexing is enabled and functioning.

indiciumx commented 1 year ago

@PVautour after our meeting with Google, did you ever look further into the SEO kit

PVautour commented 1 year ago

I don't know about an SEO kit, but I believe the main solution we concluded would be suitable was to manually create a robots.txt that would contain urls to every public page that we want crawled.

Would that suit our needs ?

jvanulde commented 1 year ago

I will defer to @indiciumx for direction.

PVautour commented 1 year ago

Ok so it's not actually the robots.txt file that is appropriate for this. It is the sitemap.xml.

I am throwing together a quick function to fetch id's from the database and generate sitemap files pointing to those pages.

One sitemap is limited to 50MB (uncompressed) and 50,000 URLs, but you can create one root sitemap that points to the others. In theory this means that it should not be much of an issue to list all of our pages.

For now i'm using a lambda and will drop the sitemaps in s3. We can then use web sub to inform search engines of any updates to our sitemap.

Here is documentation for google that lists all we need to do: Build and Submit a Sitemap

PVautour commented 1 year ago

@indiciumx

Hello,

I have a lambda generating the sitemaps for all files in a bucket here:

https://ca-central-1.console.aws.amazon.com/lambda/home?region=ca-central-1#/functions/pascal-generate-sitemap?tab=code

Just have to set the correct permissions and environment variables to generate the files and store them in s3.

Two questions:

Feel free to give me a call this morning or Monday afternoon.

jvanulde commented 1 year ago

@indiciumx @PVautour should we put the Lambda code in this repo?

PVautour commented 1 year ago

I don't think so no. It should probably be eater:

Though we could also do a monorepo style system. I wich case the source code/cloudformation for the lambda could be put in a folder within the monorepo.

I dont think we want to deploy the resulting sitemaps in this repo. I will discuss deployment of the sitemaps with chris this morning. I expect we would rather add the sitemaps to the root of the site without managing them in git. (Server redirects, automatic build process, etc)

Also interesting thing to note,I think this repo currently contains build output, but no actual source code.

On a sidenote, AWS has always fought me when trying to use git to manage it. At my current level of knowledge, I expect versioning release cloudformations is a realistic middle ground, but if you know of a successful way of decoupling the dev process from the AWS web ui/live environment I would be happy to learn from you!

I know cloudformations technically can do that, but it hasn't been super practical for me irl unfortunately.

PVautour commented 1 year ago

@indiciumx Ok so the sitemaps are here:

s3://webpresence-geocore-misc-stage/Sitemaps/

there is a sitemap.xml at the root that you would want at the root of the site. Check the content of the root sitemap to figure out where to place the rest.

@jvanulde There is still a few improvements to do to the lambda before we can close the issue, but this should be suitable for our immediate needs.

jvanulde commented 1 year ago

@indiciumx how are we going to test this? I suppose we can put it in a public directory and have Google index it.

bo-lu commented 1 year ago

@PVautour @jvanulde

I looks like this work was done on dev/stage which has records that only exist on dev/stage.

The lambda should be run on prod.

For example, here are links that work (green) and don't exist on prod (red)



<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+  <loc>https://app.geo.ca/result?lang=en&amp;id=0b229ec0-da50-4b29-88da-49c85a5944e2</loc>
</url>
<url>
+  <loc>https://app.geo.ca/result?lang=en&amp;id=0b2303be-ef05-49a8-8082-44a3eabcfa57</loc>
</url>
<url>
-  <loc>https://app.geo.ca/result?lang=en&amp;id=0b258202-2271-4ad2-a44b-f9d8c9281342</loc>
</url>
<url>
-  <loc>https://app.geo.ca/result?lang=en&amp;id=0b346aa1-3090-4223-ac84-ed7287bc78a9</loc>
</url>
<url>
-  <loc>https://app.geo.ca/result?lang=en&amp;id=0b35fc92-9e28-49c7-b1eb-607d2e608509</loc>
</url>
<url>
+  <loc>https://app.geo.ca/result?lang=en&amp;id=0b399378-eff8-4cea-97b8-b307c9b2398a</loc>
</url>
<url>
+  <loc>https://app.geo.ca/result?lang=en&amp;id=0b442f1b-1951-45c8-80ee-cfb8bceb1d72</loc>
</url>
<url>
+  <loc>https://app.geo.ca/result?lang=en&amp;id=0b50b49e-aadc-24c4-ec85-148df785fe5e</loc>
</url>
PVautour commented 1 year ago

So the reason this was not run in prod, Is because we wanted to be able to release the sitemaps as content and not infra/code.

The plan was to generate the files in staging and then release them as the records where expected to have been the same.

We could eater reassess and deploy the code to prod on a timer, or realign content on staging and prod and rerun it.

@bo-lu @indiciumx

bo-lu commented 1 year ago

@PVautour

For this time, I think it is okay. Let's see if Google is able to pick up the sitemap.

For the next time, let me know and I will sync staging before the xml files are generates.

PVautour commented 1 year ago

Ok cool thanks bo!

jvanulde commented 1 year ago

@PVautour please close if indexing is successful.

PVautour commented 1 year ago

@indiciumx can you check if indexing is working?

PVautour commented 1 year ago

Indexing seems to be correct for resources in the google console. I am still waiting for actual indexing as status is currently discovered - not indexed. aka pending.

The lambda that generates indexes now needs to be tweaked to replicate manual fixes to root sitemap.

PVautour commented 1 year ago

All the sitemaps are currently indexed. The resources within them are all pending indexing. All seems good.

PVautour commented 1 year ago

@jvanulde I was hoping the pages would be indexed by now, but they are not. I'm not shure what we should do about this, but here are some options:

Here is what I did this morning:

Tell me if you wanna chat about this issue or have any ideas.

PVautour commented 1 year ago

Update: The sitemap generation lambda has been updated to a more permanent solution. It's now production ready and should not require any manual intervention.

Still waiting on google to index though. I've looked at documentation about posting sitemaps and the console and still don't see any option other than waiting for indexation to happen unfortunately.

jaredkinger commented 1 year ago

Currently there is no sitemap for https://geo.ca URLs and there should be. The sitemap that was submitted contains only https://app.geo.ca URLs and those pages aren't really very search engine friendly. It might be better to focus on getting the more SEO/user friendly URLs on https://geo.ca indexed before moving on to indexing the map results across the various subdomains. If https://geo.ca is going to be the main entry point for users then we should focus our attention there.

Before trying to get those subdomains indexed we should try to get some of the SEO fundamentals sorted out too. For example the https://app.geo.ca results are missing the <meta name="description"> tag, <title> tag isn't updated on render and none of them have canonical tags.

It is unlikely that you'll be able to get Google to expedite indexing. Google will crawl it when it feels like it basically. Google has historically had trouble crawling and indexing sites built with React or other Javascript libraries that rely on client side rendering of content. While this has gotten better this could be part of the problem. I have some suggestions that while they may not directly impact indexing would be helpful for SEO. I'd like to get to know about more app.geo.ca so I can make appropriate suggestions.

@jvanulde @PVautour If there is a chat to be had about this issue please include me.

jaredkinger commented 1 year ago

@PVautour Is it possible to exclude URLs that return "No results." Example: https://app.geo.ca/result?lang=en&id=a8c04b4f-8c62-4d47-b41f-ab81c9865b09

These are going to return a soft 404 and if Google is only crawling X amount of URLs each time it hits the sitemap then we are prolonging the process with URLs that don't add value.

PVautour commented 1 year ago

Thanks for the input, we can certainly have a chat.

I'm pretty shure we want to get the app.geo.ca stuff indexed.

For shure the pages themselves can be improved though.

I can show you the search console. The pages are actually marked as queued for idexing. Just hasn't been done yet.

I'd be interested in picking your brain about those issues see what we can do.

Maybe we can have a meeting tomorrow?

jaredkinger commented 1 year ago

Pascal and I had a very productive meeting to hash out some details on how to best move forward with getting the app.geo.ca datasets indexed.

To start we should remove the currently submitted sitemap. It's not being crawled, generating large amounts of indexing errors and the results that are being indexed are creating poor quality results that have little chance of being found by users or generating clicks. Results that don't generate clicks will rank poorly further reducing the chances of them being found. The errors along with the poor quality results generated from the sitemap are reducing the site's crawl budget and limiting the ability for these datasets to be found and indexed.

Please note that this is not to say that the datasets themselves are of poor quality just that in Google's eyes what we are currently serving them makes for poor quality search results or content that they have deemed is not valuable for search users.

Google determines the amount of crawling resources to give each site, based on the popularity, user value, uniqueness, and serving capacity. The only ways to increase your crawl budget are to increase your serving capacity for crawls, and (more importantly) to increase the value of the content on your site to searchers.

Learn More about Crawl Budget

At the end of the day, Google is a third party and they move at their own pace. To get them indexing faster we need to give them what they want.

Geo.ca Sitemap

A sitemap should be generated Geo.ca and submitted to search console. The Yoast SEO plugin takes care of this sitemap generation and we will just need to apply a find and replace on the domain in the URLs to move it through the different deployment stages. This sitemap will be a lot smaller and should hopefully be processed quickly. There shouldn't be an issue getting any of the GEO.ca URLs indexed and while those pages still need some SEO work they do have the basics and should hopefully be able to start ranking for some long tail searches. If users can find Geo.ca then they will hopefully click on and follow links to the featured datasets on app.geo.ca and start exploring the data there.

Geo.ca links to several datasets on app.geo.ca and as apart of the crawl process Google will organically find and follow those links. While this is only only a very small subset of the data available the context provided by Geo.ca surrounding those links may help with indexing and assist Google in generating better search results for those featured datasets

Before resubmitting a new sitemap for app.geo.ca the following items should be done

The Automatically Extracted Buildings dataset will be used as the basis for below examples.

Results Page Titles

The <title> tag should be updated on render to reflect the title of the current dataset. Currently all results display GEO.CA Viewer instead of Automatically Extracted Buildings - GEO.CA Viewer resulting in indexed search results that are confusing and won't garner clicks. Google will sometimes attempt to set a more appropriate title but does not always. Search siteapp.geo.ca to see the currently indexed results.

Example of an updated title: <title>Automatically Extracted Buildings - GEO.CA Viewer</title>

Meta Descriptions

The <meta name="description"> tag should be added on render to the <head>. The description being set for og:description meta tag would work well here though we may want to truncate it to 155-160 characters as that is all that will be shown in the search results. This description may not ultimately be shown in the search result as Google may deem another paragraph of content on the page as more descriptive and use that instead. This behaviour can change per key phrase for the same page.

Example: <meta name="description" content="Automatically Extracted Buildings is a raw digital product in vector format created by NRCan. It consists of a single topographical feature class that delineates">

Note that I stripped the line breaks \n and the curly/smart quotes from this description.

Permalink Structure

The current query strings are not user or search friendly and should if possible be converted to a pretty permalink structure. The will allow keywords to be part of the URL which is a ranking signal, provide more human readable URLs and can also increase the shareability of the URLs. The ID may need to be added to the end or a unique slug system created to prevent page title collisions. The pretty permalink should be served in the sitemap and should be set as the canonical version.

Example: https://app.geo.ca/result?id=7a5cda52-c7df-427f-9ced-26f19a8a64d6&lang=en should be converted to https://app.geo.ca/result/en/automatically-extracted-buildings/ or https://app.geo.ca/result/en/automatically-extracted-buildings/7a5cda52-c7df-427f-9ced-26f19a8a64d6/ if we are unable to implement unique slugs.

A unique slug system would be saving a sanitized URL friendly version of the dataset title (automatically-extracted-buildings) in addition to the unique ID of records to avoid collisions. In scenarios where two or more pages have the same title the slugs can be kept unique by appending a number to the end of sanitized data title. The first result to have it's slug generated would have no number. This would be recommended over appending the ID for shorter more friendly URLs though there may be another identifier that could be used to keep these unique and this is open to suggestions.

Example:

Ideally the query string versions would redirect to their permalink. When linking to the results from other sites or sharing URLs the permalink should be used.

Slug Definition: A URL slug refers to the end part of a URL after the backslash (“/”) that identifies the specific page or post.

Canonical Tag

The canonical tag allows us to explicitly tell search engines the version of a URL we want indexed. This prevents multiple versions of the same page from being indexed, crawled or flagged as duplicate content. The pretty permalink should be used for this. This would also come into play should we start attempting to track any additional data via a query string that has no bearing on the content. An absolute URL should be used for the canonical.

Example: <link rel="canonical" href="https://app.geo.ca/result/en/automatically-extracted-buildings/">

Examples of URLs that would all be the same:

Learn More about Canonical URLs

Alternate Links for Localized versions

Datasets are available in both English and French. To assist search engines and ultimately assist our users in finding content in their preferred language we can take advantage of <link rel="alternate"> tags to explicitly indicate our translated versions. Alternate links should be added for both languages to each dataset and the URLs should asbolute.

Example: <link rel="alternate" hreflang="en-ca" href="https://app.geo.ca/result/en/automatically-extracted-buildings/"> <link rel="alternate" hreflang="fr-ca" href="https://app.geo.ca/result/fr/batiments-extraits-automatiquement/">

Learn More about Localized Page Versions

On a side note the <html lang="en"> should be updated to use the Canadian English subset <html lang="en-ca">. When serving French content the lang attribute should be updated to reflect that the document is in French using <html lang="fr-ca">. This is for accessibility and helps signal to screen readers the language of the document.

Another side note to consider with translation is having the root URL include the language for at least the French version. When toggling between languages this gives the user a defined URL for their preferred language that they can bookmark, link to or share. This may have an affect on the permalink structure as language should potentially always follow the domain in the url.

Example: https://app.geo.ca defaults to English https://app,geo.ca/fr/ displays French

Lastmod tag

The <lastmod> should be added for sitemaps with a date that signals the last time the content on this page was updated. Crawlers read sitemaps regularly and this let's them know when there is fresh content on a page and that they should crawl again. See W3C Datetime for acceptable timestamp formats.

No results and soft 404s

Wasting server resources on unnecessary pages can reduce crawl activity from pages that are important to you, which may cause a significant delay in discovering great new or updated content on a site.

Datasets that return "No results." should not be included in the sitemap. These pages create soft 404 indexing errors and have a negative impact on the crawl budget. This was caused by staging and production being out sync. Care should be taken to ensure there is a 1:1 match between the environments when generating the sitemap. Additionally if possible checks should be made to exclude missing datasets.

Ideally a 404 http status code should be returned for missing results though this may not be possible with the current architecture.

Server Side Rendering

This would be a nice to have and is something to consider for down the road. Server Side Rendering could help improve page performance which in turn would help with crawl efficiency. Page performance is also used as a ranking signal.

Summary (TLDR)

The current app.geo.ca sitemap is hurting more than helping and should be removed. Some basic SEO should be done and permalinks implemented on app.geo.ca so we can resubmit a clean search engine friendly sitemap. A sitemap for geo.ca should be generated and added as soon as possible to get it indexed, ranking and provide a path for search users to find geo.ca and in turn discover app.geo.ca.

app.geo.ca Tasks

geo.ca tasks

@PVautour would handle changes on the app.geo.ca side and I would handle changes for geo.ca.

@jvanulde can you please provide your approval for this path forward?

@sean-eagles please assign me to this task. Thank you.

PVautour commented 1 year ago

Impressive write up jared i'm glad we got you on the team!

In my opinion:

What is proposed > Waiting longer for my sitemap to be indexed

sean-eagles commented 1 year ago

Thanks Jared, very nice writeup, you have been assigned.

PVautour commented 1 year ago

Problematic sitemaps have been deleted in prod.

PVautour commented 1 year ago

We now set lastmod in sitemaps. This is in staging and should be reflected when we do redeploy the sitemaps.

jaredkinger commented 1 year ago

I've setup Simply Static to use absolute URLs. The sitemap index and sitemaps are now included in the static site generation. Additional I've added all of the /Viewer/ links to the sitemap. Next time we redeploy the static site we can submit the sitemap to search console.

jaredkinger commented 1 year ago

The sitemap index has been submitted to search console.

sitemap-index

jaredkinger commented 1 year ago

It's been almost two weeks(13 days) since I submitted the sitemap and Google is slowly but surely indexing the links within the sitemap.

Of the 73 links currently in the sitemap we have gone from 17 -> 28 -> 34 pages now indexed. There are some indexing issues that are currently flagged for the sitemap URLs but those are slowly resolving themselves as google recrawls the pages.

indexing-progress