ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
109 stars 31 forks source link

how to specify identifier for the metadata record #245

Open smrgeoinfo opened 1 year ago

smrgeoinfo commented 1 year ago

building on #210, Here's a discussion thats come up in a CODATA WG. SOSO should weigh in on this issue in the guidelines.

A metadata record has two parts; one part is about the metadata record itself, the other part is the content about the resource that the metadata documents. The part about the record specifies the identifier for the metadata record, agents with responsibility for the record, when it was last updated, what specification or profiles the metadata serialization conforms to, and other optional properties that are deemed useful. The metadata about the resource has properties about the resource like title, description, responsible parties, spatial or temporal extent (as outlined in the Metadata Content Requirements section).

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself. There are two patterns that could be used to structure the two parts of the metadata record:

Option 1. The root object is the described resource:

{    "@context": "https://schema.org",
    "@id": "ex:URIforDescribedResource",
    "@type": "ImageObject",
    "title": "Picture of analytical setup",
    "description": "Description of the resource".
    "subjectOf": {
        "@id": "ex:URIforTheMetadata",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "encoding": {
            "@type": "MediaObject",
            "dcterms:conformsTo": "https://example.org/cdif-metadataSpec"
          }
        "about":{"@id":"ex:URIforDescribedResource"}
    },
}

Option 2: root object is the metadata record

{   "@context": "https://schema.org",
    "@id": "ex:URIforTheMetadata",
    "@type": "DigitalDocument",
    "dateModified": "2017-05-23",
    "encoding": {
        "@type": "MediaObject",
        "dcterms:conformsTo": "https://example.org/cdif-metadataSpec"
       },
     "about": {
         "@id": "ex:URIforDescribedResource",
         "@type": "ImageObject",
         "title": "Picture of analytical setup",
         "description": "Description of the resource",
         "subjectOf":"ex:URIforTheMetadata"
       }   }

The rdf triples generated by these two approaches are identical, so if the metadata are always harvested to a triple store it makes no difference. However, allowing either approach would create interoperability problems for harvesters that are parsing the metadata as JSON-- the paths to the same metadata elements are different in the two approaches. It is our judgment that option one above (root object is the described resource) is the more widely used serialization, commonly without specifying the metadata record specific properties, or using the schema.org ‘sd...’ properties to provide some of the metadata ‘metadata’.

what should be the recommended serialization?

datadavev commented 1 year ago

JSON-LD is a serialization of RDF, so is describing a graph. I'm not sure there's a definitive "root object" in these examples. Instead there's a graph of related nodes, any of which could be considered a root to start traversal (except perhaps the MediaObject anonymous node). So it is pretty much always incorrect to treat a json-ld document as a plain json document since inferred JSON semantics such as list ordering don't apply and vice-versa, semantics such as graph structure provided by json-ld parsing rules are unknown by a json parser. It's much the same as trying to use xpath to process RDF-XML documents, it'll work for specific examples but fails in the general sense.

datadavev commented 1 year ago

I believe this is the type of issue that is addressed by JSON-LD Framing ^1. So instead of suggesting a preferred pattern of serialization (generally cumbersome when serializing from an RDF source), it may be more appropriate to suggest a frame document to apply when inspecting from a perspective. If the desired form is option 1 above, then use a frame like:

{
  "@context": {
    "@vocab":"http://schema.org/",
    "subjectOf":{"@reverse":"about"}
  },
  "@type": "ImageObject",
  "subjectOf": {}
}

This places the ImageObject at the top level which may make parsing with plain JSON a little more tractable.

smrgeoinfo commented 1 year ago

Makes sense to address the alternate serializations; it only matters if you're parsing the metadata as JSON. That there needs to be an identifier for the metadata digitalObject distinct from the thing it describes is what I should have emphasized.

datadavev commented 10 months ago

Reading through the examples above, it strikes me that neither option 1 or 2 above provide an identifier for the described resources. So although this issue seems to be about preferred serialization pattern for nested documents, it may be helpful to make the example a little more complete by adding schema:identifier properties to the DigitalDocument and the ImageObject. That makes a clear statement that the entities described by the graph have those identifiers. Otherwise one might infer that the @id values (i.e. the graph node identifiers) are equivalent to the identifiers for the objects being described, which seems incorrect. The @id value is an identifier for a node in the graph, schema:identifier is an identifier for the thing being described by the graph.

So rewriting option 1:

{    
    "@context": "https://schema.org",
    "@id": "ex:URIforImageObjectNode",
    "@type": "ImageObject",
    "title": "Picture of analytical setup",
    "description": "Description of the resource",
    "identifier": "ex:URIforDescribedResource",
    "subjectOf": {
        "@id": "ex:URIforDigitalDocumentNode",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "identifier": "ex:URIforTheMetadata",
        "encoding": {
            "@type": "MediaObject",
            "dcterms:conformsTo": "https://example.org/cdif-metadataSpec"
          }
        "about":{"@id":"ex:URIforImageObjectNode"}
    }
}

makes it clear that ex:URIforImageObjectNode is the node identifier for the graph about the ImageObject with identifier ex:URIforDescribedResource, and that ImageObject is the subject of a DigitalDocument. That DigitalDocument is in turn described by the graph with node identifier ex:URIforDigitalDocumentNode and the document itself has an identifier ex:URIforTheMetadata.

datadavev commented 10 months ago

Note that conceptually, the graph ex:URIforImageObjectNode and the document identified by ex:URIforTheMetadata (i.e. the document itself, not the ex:URIforDigitalDocumentNode node in the graph) fill the same role. They both describe the image identified by ex:URIforDescribedResource. In fact, the contents of the graph ex:URIforImageObjectNode would perhaps ideally be generated from the content of the document ex:URIforTheMetadata, since that document is presumably the authoritative source of information about the image ex:URIforDescribedResource.

smrgeoinfo commented 10 months ago

I don't get the distinction between the 'graph' and the 'document' In my understanding the digitalDocument (in my example) is the metadata describing the resource (image object in the example). This metadata is represented using rdf -- a logical graph. The document and graph are the same thing.

The resource (image object) is described by rdf triples-- statement in which the image is the subject, some property is the predicate, and a value is the object. I expect the uri for the described resource to be the subject of this statements.

Converting the JSON-LD in @datadavev example:

<ex:URIforDigitalDocumentNode> sdo:about <ex:URIforDescribedResource> .
<ex:URIforDigitalDocumentNode> sdo:dateModified "2017-05-23"^^ sdo:Date .
<ex:URIforDigitalDocumentNode> sdo:encoding _:b0 .
<ex:URIforDigitalDocumentNode> sdo:identifier "ex:URIforTheMetadata" .
<ex:URIforDigitalDocumentNode> rdfs:type sdo:DigitalDocument .
<ex:URIforImageObjectNode> sdo:description "Description of the resource" .
<ex:URIforImageObjectNode> sdo:identifier "ex:URIforDescribedResource" .
<ex:URIforImageObjectNode> sdo:subjectOf <ex:URIforDigitalDocumentNode> .
<ex:URIforImageObjectNode> sdo:title "Picture of analytical setup" .
<ex:URIforImageObjectNode> rdfs:type sdo:ImageObject .
_:b0 dcterms:conformsTo "https://example.org/cdif-metadataSpec" .
_:b0 rdfs:type sdo:MediaObject .

What does ex:URIforImageObjectNode actually identify. Do the statements about ImageObjectNode make sense? The digitalDocumentNode is the metadata record-- a digital document.

smrgeoinfo commented 10 months ago

here are the triples for my example 2 (syntax buggered up to make the lines shorter)

<ex:URIforDescribedResource> sdo:description "Description of the resource" .
<ex:URIforDescribedResource> sdo:subjectOf  <ex:URIforTheMetadata> .
<ex:URIforDescribedResource> sdo:title "Picture of analytical setup" .
<ex:URIforDescribedResource> rdfs:type sdo:ImageObject .
<ex:URIforTheMetadata> sdo:about <ex:URIforDescribedResource> .
<ex:URIforTheMetadata> sdo:dateModified "2017-05-23"^^sdo:Date .
<ex:URIforTheMetadata> rdfs:type  sdo:DigitalDocument .
<ex:URIforTheMetadata> sdo:encoding   _:b0 .
_:b0 dcterms:conformsTo "https://example.org/cdif-metadataSpec" .
_:b0 rdfs:type sdo:MediaObject .

this seems a lot clearer to me, the graph nodes have the same identifier as the thing they represent.

datadavev commented 10 months ago

Sure, you could do that, but I think it is incorrect to always infer that @id has the same purpose as schema:identifier. To me at least it is much clearer that @id refers to statements about a thing and schema:identifier refers specifically to the thing.

datadavev commented 10 months ago

@smrgeoinfo In your example, what is returned when resolving ex:URIforDescribedResource?

smrgeoinfo commented 10 months ago

I'd argue that what you get when you resolve a URI depends on what it identifies, and the conventions of the identifier scheme. In the example above, the resource is typed as an sdo:ImageObject, defined as "An image file", so the resource the URI identifies is a Digital Object. In general, I'd expect the default URI resolution to get that digital object. Content negotiation or signposting links might provide access to metadata. The metadata example doesn't include any distribution information, but the convention I like is that if the metadata is about a DigitalObject, then the sdo:URL in the 'about' section would get that digitalObject.

Things are much more interesting if the metadata is about a non-digital object that might have multiple digital representations. Then the distribution section is critical.

datadavev commented 10 months ago

So I managed to get myself confused about a subject of the graph (i.e. ex:URIforDescribedResource) and the graph itself (i.e. the entire JSON-LD document). Steve is correct that the @id property (the Node ID) is the subject of the various statements contained therein.

Perhaps one approach is to consider is that we are making statements about the graph ex:URIforDescribedResource. That is, we want to make statements about the graph that describes the described resource. The JSON-LD spec § 4.9 Named Graphs describes this scenario, and following that pattern the structure would be like:

{    
    "@context": "https://schema.org",
    "@id": "ex:URIforTheMetadata",
    "@type": "DigitalDocument",
    "dateModified": "2017-05-23",
    "encoding": {
        "@type": "MediaObject",
        "dcterms:conformsTo": "https://example.org/cdif-metadataSpec"
    },
    "about":{"@id":"ex:URIforDescribedResource"},
    "@graph": [
        {
            "@id": "ex:URIforDescribedResource",
            "@type": "ImageObject",
            "title": "Picture of analytical setup",
            "description": "Description of the resource",
            "subjectOf": {"@id": "ex:URIforTheMetadata"}
        }
    ]
}

This results in quad statements like:

<ex:URIforDescribedResource> <http://schema.org/description> "Description of the resource" <ex:URIforTheMetadata> .
<ex:URIforDescribedResource> <http://schema.org/subjectOf> <ex:URIforTheMetadata> <ex:URIforTheMetadata> .
<ex:URIforDescribedResource> <http://schema.org/title> "Picture of analytical setup" <ex:URIforTheMetadata> .
<ex:URIforDescribedResource> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> <ex:URIforTheMetadata> .
<ex:URIforTheMetadata> <http://schema.org/about> <ex:URIforDescribedResource> .
<ex:URIforTheMetadata> <http://schema.org/dateModified> "2017-05-23"^^<http://schema.org/Date> .
<ex:URIforTheMetadata> <http://schema.org/encoding> _:b0 .
<ex:URIforTheMetadata> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/DigitalDocument> .
_:b0 <http://purl.org/dc/terms/conformsTo> "https://example.org/cdif-metadataSpec" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/MediaObject> .

This provides that statements about ex:URIforDescribedResource are being made in the context of the named graph ex:URIforTheMetadata, and that named graph has some properties describing it (e.g. http://schema.org/dateModified).

I don't think there are defined semantics for the intent of a named graph other than providing a context for statements to be made about the container of the graphs. Using the so:about and so:subjectOf statements makes the relationship between the two subjects clearer.

Does this approach provide any benefit over the arguably simpler alternative construct?:

{
    "@context": "https://schema.org",
    "@id": "ex:URIforTheMetadata",
    "@type": "DigitalDocument",
    "dateModified": "2017-05-23",
    "encoding": {
        "@type": "MediaObject",
        "dcterms:conformsTo": "https://example.org/cdif-metadataSpec"
    },
    "about":{
        "@id": "ex:URIforDescribedResource",
        "@type": "ImageObject",
        "title": "Picture of analytical setup",
        "description": "Description of the resource",
        "subjectOf": {"@id": "ex:URIforTheMetadata"}
    }
}
<ex:URIforDescribedResource> <http://schema.org/description> "Description of the resource" .
<ex:URIforDescribedResource> <http://schema.org/subjectOf> <ex:URIforTheMetadata> .
<ex:URIforDescribedResource> <http://schema.org/title> "Picture of analytical setup" .
<ex:URIforDescribedResource> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/ImageObject> .
<ex:URIforTheMetadata> <http://schema.org/about> <ex:URIforDescribedResource> .
<ex:URIforTheMetadata> <http://schema.org/dateModified> "2017-05-23"^^<http://schema.org/Date> .
<ex:URIforTheMetadata> <http://schema.org/encoding> _:b0 .
<ex:URIforTheMetadata> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/DigitalDocument> .
_:b0 <http://purl.org/dc/terms/conformsTo> "https://example.org/cdif-metadataSpec" .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/MediaObject> .
ksonda commented 10 months ago

The latter approach is for the most part what we do in https://geoconnex.us, which is based on SELFIE

smrgeoinfo commented 10 months ago

After discussion at monthly group meeting, I'll edit guide text and create PR, based on the option 2 approach (same as second approach in @datadavev comment above). I think it should go in GETTING-STARTED.md because it applies to any SOSO metadata.