I-GUIDE / data-catalog-archive

The IGUIDE data catalog
3 stars 3 forks source link

RFC: Can Schema.org be used to capture more than just "Core" metadata. #1

Closed Castronova closed 7 months ago

Castronova commented 2 years ago

@aaraney @sblack-usu @horsburgh

HydroShare uses schema.org to capture "core" resource metadata using Thing/CreativeWork/Dataset. We can adopt this pattern for the IGUIDE data catalog, however can schema.org be used to describe data-type specific metadata too? For instance:

Spatial Data

Temporal

Software

Moreover, can we aggregate datasets (or resources) that consist of more that one type of dataset using https://schema.org/DataCatalog ?

{
  "@context": "https://schema.org/",
  "@type": "Datacatalog",
  "name": "Nitrous oxide concentrations from the R/V Falkor expedition FK160115 in the Central Pacific from January to February 2016" ,
  "dataset": [{
    "@type": "Dataset",
    "name": "dataset 1"
  },
  {
    "@type": "Dataset",
    "name": "dataset 2"
  },
  {
    "@type": "Dataset",
    "name": "dataset 3"
  }]
}

Or perhaps using the distribution element to provide spatial and temporal metadata for individual files.

{
  "@context": "https://schema.org/",
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://darchive.mblwhoilibrary.org/bitstream/1912/28977/1/dataset-775849_proteomz-nitrous-oxide-data__v1.tsv",
      "encodingFormat": "text/tab-separated-values",
      "contentSize": "15077 bytes"
    }
  ]
}

It may be advantageous for us to align closely with Schema.org, rather than creating custom JSON+LD, because it's gaining traction in our community and is already used in several scientific data repositories. This could potentially enable us to ingest (or harvest) directly from existing data catalogs in the future.

Castronova commented 2 years ago

This is a great resource that dives deeper into the Schema.org recommendations from ESIP: https://github.com/ESIPFed/science-on-schema.org/blob/226-esip-summer-mtg-2022-tutorial/tutorials/esip-summer-mtg-2022/README.md

horsburgh commented 2 years ago

HydroShare's Schema.org implementation followed the ESIP Science on Schema.org recommendations that were available at the time we did the mapping. We probably need to revisit to stay current.

horsburgh commented 2 years ago

@Castronova - spatial and temporal are properties of a Schema.org Dataset. Technically, they are properties of a Thing, but a Dataset is a CreativeWork that is a Thing. They aren't data types themselves. The spatial property is expected to be of type place, and the temporal property is expected to be of type DateTime or text.

Yes, Schema.org could be used to help with some data-type specific metadata. I hadn't looked at SoftwareSourceCode on WebApplication yet, but both seem useful. We have a granularity issue that we would have to work through, though. The semantics of the Schema.org metadata embedded in the landing page for a HdyroShare resource are such that the metadata describe the resource as a whole. In other words - HydroShare's Schema.org metadata is designed to facilitate discovery of resources (and not specific content within resources).

horsburgh commented 2 years ago

With regard to DataCatalog - Are you suggesting that a HydroShare resource could be considered a DataCatalog? The definition is simply "A collection of datasets", so I guess that works semantically, but I'm not sure what the implications are for crawlers of Schema.org metadata. For instance, if Google's crawlers for Google Dataset Search focus on the DataSet object, then would a DataCatalog also show up in Google Dataset search? I'm not sure. We don't want to ruin HydroShare's current functionality with Google Dataset Search.

This documentation indicates that DataCatalog be used for describing an "overall collection". DataCatalog has a property "DataSet" to indicate a dataset contained within the DataCatalog. This documentation also suggests that a Dataset object may itself be a collection of data about something. Definition of Dataset is "A body of structured information describing some topic(s) of interest.

If we want to embed information about the content aggregations within a HydroShare resource, I suggest we keep our existing Schema.org implementation, but use the property hasPart. This would indicate that a HydroShare resource, that is a Schema.org Dataset has multiple parts that are themselves Schema.org CreativeWork objects that could be further described in embedded JSON.

With regard to your link about containers - If we are going to use Schema.org, we should probably stick to what is accepted in Schema.org right now. Otherwise, crawlers may not know how to interpret other stuff we might put in there.

Castronova commented 2 years ago

Thanks Jeff, that's kind of what I was thinking but I'm still struggling to identify best practices. It seems like we can leverage Schema.org elements in some interesting ways, such as HydroShare resources being "DataCatalogs," but then we don't know if web crawlers will scrape the metadata. There seems to be a disconnect between the schema definitions and web crawler implementation, the latter I'm not sure is documented anywhere (let me know if you know of docs somewhere).

Regarding the containers, I completely agree. It looks very cool but it's just a one-off prototype and is not part of the standard. I guess my motivation for trying have the IGUIDE catalog "harvest" metadata from strictly schema.org definitions (if possible) is so that we aren't creating one-off solutions for metadata that doesn't fit within the DataSet entity.

horsburgh commented 2 years ago

@Castronova - I don't know of anywhere that documents what the crawlers actually do. Google has a structured data testing tool that will tell you if your Schema.org implementation is valid, but it doesn't tell you what they actually do with what you provide. But, let's start with Schema.org and see what we can do!

aaraney commented 2 years ago

Thanks for the above discourse, in general I think I am on the same page with both of you. It seems that the conversation has shifted a bit towards how a search engine might use these metadata. And right, that is the founding purpose of developing Schema.org, but I think it is important for us to think how our search engine(s) will use these metadata?

Hopefully more concretely, it seems apparent that you can specify a thing's elements using different properties. Do we have a preference based on the search and discovery approach we plan on taking? Do we envision that our search and discovery mechanisms are driven by something like an RDF triple store or a property graph approach? Based on this answer, does this affect our usage of Schema.org? If so, how? So, as an example, @horsburgh mentioned the hasPart property, do we have preference on the predicates (the predicate here being hasPart) that we suggest user's use (and metadata generating services) when describing a thing with intent of simplifying or improving our search and discovery mechanisms?

Castronova commented 2 years ago

Josh directed me to a couple other efforts that we may also want to explore in addition to SchemaOrg:

  1. https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System
  2. https://www.ogc.org/def-server

Just dropping these here so we don't forget them.