Google Dataset search fields added for dataset pages (automatically)

rufuspollock commented 4 months ago

Add some special fields to DataHub dataset pages so they get indexed better by google.

See here for instructions https://developers.google.com/search/docs/appearance/structured-data/dataset

Should be pretty simple to do from the metadata we already have for datasets ...

### Tasks
- [x] Shape this piece of work e.g. research what fields to add, how we could add them ⏲️2h
- [ ] TODO: add implementation steps ...

rufuspollock commented 4 months ago

I think the value of this is very high and i suspect doing is very low - we just need to add some fields to the html <head>

gradedSystem commented 4 months ago

Situation

Enhancing dataset page indexing in Google Search is crucial for improving visibility and accessibility of our content.

Problem

Currently, our dataset pages lack structured data fields required for optimal indexing according to schema.org standards.

Solution

Implement structured data fields using JSON-LD to provide search engines with detailed metadata about our datasets.

Appetite

Implementation of JSON-LD structured data should be completed within 2-3 days, including testing and adjustments.

Rabbit-holes

Ensuring all required fields are correctly populated in JSON-LD.
Testing and validating the impact on search rankings may require monitoring by using Google Search Console.
Handling potential discrepancies between schema.org guidelines and actual search engine algorithms.

No-goes

Avoid implementing incomplete or incorrect JSON-LD structures that could potentially harm search engine indexing.

Appendix

Example JSON-LD script and suggestions for testing on specific dataset pages like Air Pollution Collection. Regular monitoring through Google Search Console recommended for evaluating effectiveness.

olayway commented 4 months ago

@gradedSystem

Can you create a draft of a JSON-LD that would specify exactly which fields we'd include, and from which part of the Data Package they would come from. Something like:

{
  ...
  name: datapackage.title,
  description: datapackage.description,
  license : datapackage.licences[0],
  ...
}

olayway commented 4 months ago

This may also be helpful when it comes to implementation: https://nextjs.org/docs/app/building-your-application/optimizing/metadata#json-ld

gradedSystem commented 4 months ago

Here is the JSON-LD format that I tried to incorparate everything from the metadate that is available here: https://specs.frictionlessdata.io/data-package/#metadata

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "description": "datapackage.description",
  "name": "datapackage.name",
  "title": "datapackage.title",
  "url": "datapackage.homepage",
  "identifier": [
    "datapackage.id[0]",
    "datapackage.id[1]",
    ...
  ],
  "isAccessibleForFree": true,
  "license": [
    {
      "@type": "datapackage.licenses[0].title",
      "name": "datapackage.licenses[0].name",
      "url": "datapackage.licenses[0].path"
    },
    {
      "@type": "datapackage.licenses[1].title",
      "name": "datapackage.licenses[1].name",
      "url": "datapackage.licenses[1].path"
    },
    ...
  ],
  "creator": [
    {
      "@type": "datapackage.contributors[0].organization",
      "url": "datapackage.contributors[0].path",
      "name": "datapackage.contributors[0].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[0].email"
      }
    },
    {
      "@type": "datapackage.contributors[1].organization",
      "url": "datapackage.contributors[1].path",
      "name": "datapackage.contributors[1].title",
      "contactPoint": {
        "@type": "ContactPoint",
        "email": "datapackage.contributors[1].email"
      }
    },
      ...
  ],
  "isPartOf": [
    "datapackage.sources[0].path",
    "datapackage.sources[1].path",
    ...
  ],
  "dateCreated": "datapackage.created",
  "dateModified": "datapackage.updated",
  "citation": "datapackage.id",
  "version": "datapackage.version"
}
</script>

cc @olayway

olayway commented 4 months ago

Only one question I have is if we can also use other fields listed in schema.org can be used. I'll try to find out. But I think we're good to go.

olayway commented 3 months ago

@gradedSystem what's the status of this?

olayway commented 3 months ago

The script is being successfully added to the HTML:

But when testing any of our core sites URLs it seems they can't even be accessed:

This is because our dataset pages still return 500 initially. Old issue that we thought was fixed (or rather for which we found a workaround): https://github.com/datopian/datahub-next/issues/275

FIXED and will open a new one for 500 errors

datopian / datahub