ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
113 stars 33 forks source link

Additional hints in sitemaps to support efficient harvesting #200

Open datadavev opened 2 years ago

datadavev commented 2 years ago

Some collections may have large numbers of records describing different kinds of information (e.g. Datasets, Awards, and People) that may each have landing pages, and each landing page may have an entry in the sitemap.

An indexer only interested in Datasets would need to inspect all entries advertised in the sitemap to find Dataset entries, which can be inefficient and a needless use of resources.

Sitemaps are extensible, and one option may be to provide type hints in the <url> section of the sitemap. For example:

<?xml version="1.0" encoding="UTF-8"?>
<urlset 
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <url>
    <loc>https://arcticdata.io/catalog/view/doi%3A10.18739%2FA2ST7DZ2Q</loc>
    <lastmod>2021-12-07T12:15:05Z</lastmod>
    <rdf:type>http://schema.org/Dataset</rdf:type>
  </url>
</urlset>

An obvious challenge is that many types may be expressed in a single landing page, and so which should be specified in the hint? This would be up to the provider, if there is a clear intention of presenting a specific type in the referenced <loc>, then a hint can be provided, and such hints may be used by a consumer.