Cross-Domain-Interoperability-Framework / Discovery

Repository for work on CDIF Discoverability workstream
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

metadata about the metadata record-- subjectOf, about.... #13

Open smrgeoinfo opened 1 month ago

smrgeoinfo commented 1 month ago

CDIF discovery document asserts A metadata record has two parts; one part is about the metadata record itself, the other part is the content about the resource that the metadata documents. The part about the record specifies the identifier for the metadata record, agents with responsibility for the record, when it was last updated, what specification or profiles the metadata serialization conforms to, and other optional properties of the metadata that are deemed useful.

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself.

smrgeoinfo commented 1 month ago

There are two patterns that could be used to structure the two parts of the metadata record:

Option 1. The root object is the described resource:

{   "@context": "https://schema.org",
    "@id": "ex:URIforDescribedResource",
    "@type": "ImageObject",
    "name": "Picture of analytical setup",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforTheMetadata",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "encoding": {
            "@type": "MediaObject",
            "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
        "about":{"@id":"ex:URIforDescribedResource"}
    }  }

Option 2: root object is the metadata record

{   "@context": "https://schema.org",
    "@id": "ex:URIforTheMetadata",
    "@type": "DigitalDocument",
    "dateModified": "2017-05-23",
    "encoding": {
          "@type": "MediaObject",
          "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
    "about": {
         "@id": "ex:URIforDescribedResource",
         "@type": "ImageObject",
         "identifier":"identifier for thing in the world (e.g. doi)",
         "name": "Picture of analytical setup",
         "description": "Description of the resource",
         "subjectOf":{"@id":"ex:URIforTheMetadata"}
       }   }

The rdf triples generated by these two approaches are identical

smrgeoinfo commented 1 month ago

the metadata about the metadata is important in harvesting/federated catalog systems to keep track of where metadata came from, what format/profile it uses (harvesters need this to process), and update dates. Many people are using the approach with the root of the schema.org record with "@id": "ex:URIforDescribedResource" (first approach above); this is the information that goes into search indexes in general.

from that point of view the first approach makes more sense as it follows common practice. The pit fall is that the 'subjectOf' property is widely used for all kinds of things, so care is necessary to find the 'subjectOf' that provides the 'self' information about the metadata record. I suggest modifying the above encoding to include a description string that clearly identifies the subjectOf link to the metadata digital document.

{   "@context": "https://schema.org",
    "@id": "ex:URIforDescribedResource",
    "@type": "ImageObject",
    "name": "Picture of analytical setup",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforTheMetadata",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
        "description":"this metadata document",
        "encoding": {
            "@type": "MediaObject",
            "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
          },
        "about":{"@id":"ex:URIforDescribedResource"}
    }  }

including the 'about' property with the back link to ex:URIforDescribedResource is useful but could be calculated with inverse of the subjectOf property.

pbuttigieg commented 1 month ago

xref WorldFAIR D11.3, Section 1.2

In the JSON-LD specification , the @id is understood as a reference node in a graph, which has a value that leads to a serialisation of a metadata graph (i.e. the JSON-LD document itself). The example given is and @id value of http://me.markus-lanthaler.com/

That URL resolves to a JSON-LD document that (as of the timestamp on this comment), is:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "image": {
      "@type": "@id"
    }
  },
  "@id": "http://me.markus-lanthaler.com/",
  "@type": "Person",
  "name": "Markus Lanthaler",
  "honorificPrefix": "Dr.",
  "image": "http://www.markus-lanthaler.com/images/markus-lanthaler.jpg",
  "url": "http://www.markus-lanthaler.com/",
  "nationality": {
    "@type": "Country",
    "name": "Italy"
  },
  "jobTitle": "Software Engineer",
  "affiliation": {
    "@type": "Organization",
    "name": "Google",
    "url": "http://www.google.com/"
  },
  "worksFor": {
    "@type": "Organization",
    "name": "Google",
    "url": "http://www.google.com/"
  },
  "sameAs": [
    "https://twitter.com/MarkusLanthaler",
    "https://www.google.com/+MarkusLanthaler",
    "https://www.linkedin.com/in/markuslanthaler"
  ]
}

This example uses schema.org semantics, and is typed as @Person. The @type and @id are JSON-LD parameters, that give way to the semantics of the @context once you're in the graph defined by the JSON-LD file. This is the approach that ODIS uses, which does not conform to either option above, and should be offered as another option.

To identify the resource described by the metadata, the schema:identifier property should be used. This is a lean way to discover and combine metadata graphs without either a) having to disentangle the identifier of the metadata from the entity described or b) having to - persistently, potentially hundreds of thousands of times - strip away content that is not about the enitites of interest (the DigitalDocument encapsulation in Option 2).

In the schema.org world, metadata about the metadata record itself can be provided by properties like schema:sdPublisher or using additonalProperty properties with PropertyValue types as values. However, in a recent CDIF WG call, @smrgeoinfo correctly raised that the use of schema.org's sd properties step out of the logic above: they are not about the typed entity described by the subgraph, but about the metadata record (the JSON-LD file, in this case). One has to "just know" that, as all other properites are about the typed entity described by the JSON-LD file.

This is resolvable by having another file (F2) containing metadata about the JSON-LD record above (F1). If one wants to serialise F2 in JSON-LD with schema.org semantics, one has to be a little careful.

F2 would look something like (with as many properities from Dataset as one needs; note values are fictional):

{
  "@context": {
    "@vocab": "http://schema.org/"
  },
  "@id": "http://metadata.about.me.markus-lanthaler.com/",
  "@type": "Dataset",
  "name": "Metadata about the Person, Markus Lanthaler, in JSON-LD",
  "identifier": "http://me.markus-lanthaler.com/",
  "encodingFormat": "application/json+ld",
  "dateCreated": "2024-02-13",
  "dateModified": "2024-05-23",
  "datePublished": "2024-05-23",,
  "creator": {
    "@type": "Organization",
    "name": "Some Org",
    "url": "http://www.someorg.com/"
  },
  "maintainer": {
    "@type": "Organization",
    "name": "Some Org",
    "url": "http://www.someorg.com/"
  }
}

Naturally, this could end up being metadata inception, so one has to draw the line on explicit representation of metadata about (meta)data somewhere and then rely on, e.g., a file system's native metadata functions.

smrgeoinfo commented 1 month ago

@pbuttigieg thanks for the response. I think the solution I suggested is pretty much equivalent to your F2, Except I am proposing to include it in line. I don't see a big problem for processors who don't care about it to ignore it.

There are couple issues in F2-- for instance it generates this triple: <http://metadata.about.me.markus-lanthaler.com/> <http://schema.org/identifier> "http://me.markus-lanthaler.com/" . which says that http://me.markus-lanthaler.com/ identifies a metadata record, when I think the intention is that it identify a person.

what is missing is the 'about' link from F2 to the node it describes which is the "about":{"@id":"ex:URIforDescribedResource"} statement in my suggestion. I'd expect "about":{"@id":"http://me.markus-lanthaler.com/" in the JSON-LD

I was proposing a more specific statement of the encoding format, pointing to a specific profile. In the wild, it would probably be useful to make this an array including generic and specific formats along the lines of [json, json-ld, specificprofile]

I also propose that the metadata record is better represented as a "DigitalDocument" than "Dataset", since its a single digital object.

perhaps using "name": "Metadata about ..."instead of "description": "Description of the resource" is a better approach. Thoughts?

aside: what is a 'node'-- when I parse https://www.w3.org/TR/rdf11-concepts/#dfn-node literally, I end up with: ...a node is a subject or object... a subject is an IRI or a blank node; an object is an IRI, a literal or a blank node. Alternate interpretations -- is the node the IRI (string), blank node (digital object), literal (string), or the thing those symbols represent; does the IRI, blank node, or literal represent a thing in the world or a digital representation (one of many possible) for that thing? If we're not clear which interpretation is under discussion, misunderstanding results.

pbuttigieg commented 1 month ago

@pbuttigieg thanks for the response. I think the solution I suggested is pretty much equivalent to your F2, Except I am proposing to include it in line.

I'm not sure it is - the way the aboutness is handled is quite different.

The R sub-principles in the FAIR principles state that the metadata should persist after any data they describe are deleted. Thus ODIS will recommend keeping (meta)metadata separate.

I don't see a big problem for processors who don't care about it to ignore it.

I do. It seems like an entirely avoidable issue, and at scale it's many operations, thus avoidable energy use. At any rate, CDIF guidance shouldn't prescribe one or the other.

There are couple issues in F2-- for instance it generates this triple: <http://metadata.about.me.markus-lanthaler.com/> <http://schema.org/identifier> "http://me.markus-lanthaler.com/" . which says that http://me.markus-lanthaler.com/ identifies a metadata record, when I think the intention is that it identify a person.

I think that's consistent- the identifier value space is different from the @id value space . this is consistent with the RDF guidance you pasted below.

what is missing is the 'about' link from F2 to the node it describes which is the "about":{"@id":"ex:URIforDescribedResource"}

As mentioned and explained above , I think this isn't correct. F2 isn't about the described resource.

statement in my suggestion. I'd expect "about":{"@id":"http://me.markus-lanthaler.com/" in the JSON-LD

As above, I don't think that's right. F2 is about F1, not the thing F1 describes.

I was proposing a more specific statement of the encoding format, pointing to a specific profile. In the wild, it would probably be useful to make this an array including generic and specific formats along the lines of [json, json-ld, specificprofile]

I'm not sure what is meant here.

I also propose that the metadata record is better represented as a "DigitalDocument" than "Dataset", since its a single digital object.

Both work, Dataset is more useful and accurate IMO

perhaps using "name": "Metadata about ..."instead of "description": "Description of the resource" is a better approach. Thoughts?

risky - names are fickle

aside: what is a 'node'-- when I parse https://www.w3.org/TR/rdf11-concepts/#dfn-node literally, I end up with: ...a node is a subject or object... a subject is an IRI or a blank node; an object is an IRI, a literal or a blank node. Alternate interpretations -- is the node the IRI (string), blank node (digital object), literal (string), or the thing those symbols represent; does the IRI, blank node, or literal represent a thing in the world or a digital representation (one of many possible) for that thing? If we're not clear which interpretation is under discussion, misunderstanding results.

I'll think a bit more, but my first approximation is that it's a lot about the predicate.

smrgeoinfo commented 4 weeks ago

Next iteration. @id can identify a JSON-LD object or the thing that that object is about. It's ambiguous as defined in the various specs. Proposed CDIF solution:

Convention - include schema:identifier property that identifies a thing in the world that is the subject of a JSON-LD graph node. First guess default is then that @id identifies the 'representation' -- the JSON object that contains the @id element.

Longer explanation from draft CDIF handbook:

In a harvesting/federated catalog system some metadata about the metadata is important to keep track of where metadata came from, what format/profile it uses (harvesters need this to process), and update dates see Metadata Content Requirements. Unambiguous expression of this information requires making statements about a metadata record distinct from the thing in the world that the metadata describes (See Github issues 1,2 ). In an RDF framework, this requires a distinct identifier for the metadata record object that will serve as the subject for these triples.

Schema.org includes several properties that can be used to embed information about the metadata record in the resource metadata: sdDatePublished, sdLicense, sdPublisher, but lacks a way to provide an identifier for the metadata record distinct from the resource it describes, to specify other agents responsible for the metadata except the publisher, or to assert specification or profile conformance for the metadata record itself.

In the RDF serialization, Schema.org metadata records are JSON-LD node objects, and include an "@id" keyword with a value that identifies the node. This identifier can be interpreted to represent a thing in the world that the metadata record (the 'node') is about, or to represent the metadata record (a JSON object) itself. Here is a short example record (other '@' properties are explained below):

{   "@context": "https://schema.org",
    "@id": "ex:URIforResource",
    "name": "unique title for the resource",
    "description": "Description of the resource",
    "dateModified": "2017-05-23"
}

When this JSON-LD is converted to RDF triples (e.g. using the JSON-LD playground ), this results:

<ex:URIforResource> <http://schema.org/description> "Description of the resource" .
<ex:URIforResource> <http://schema.org/name> "unique title for the resource" .
<ex:URIforResource> <http://schema.org/dateModified> "2017-05-23"^^<http://schema.org/Date> .

The interpretation of the first two sets of triples would be that they are statements about the thing in the world that the metadata record is about. The third triple is ambiguous-- was the metadata content modified, or the described resource in the world? There does not seem to be any recognized best practice or consensus for dealing with this issue, so CDIF defines these conventions.

Use the schema.org identifier property to identify a thing in the world that is the subject of the JSON-LD node. The identified thing might be physical, imaginary, abstract, or a digital object. The JSON-LD @id property identifies a node in a graph, and can be interpreted in different ways; as a URI it is expected to dereference to produce the same JSON-LD object in which it is defined. Given this convention, when the metadata record is processed, the processor should use the schema:identifier as subject of triples about the subject of the metadata record to avoid ambiguity. In addition, this convention would suggest that if a schema:identifier property is present, the @id property should be interpreted to identify the JSON object that is the representation of the node in the knowledge graph.

Statements about the metadata record as a distinct entity should be made using a separate identified node object. This node object can be embedded in the metadata record about the resource in the world (Example 1 below), or published as a separate node (Example 2 below).

{   "@context": [
        "https://schema.org",
        {"dcterms": "http://purl.org/dc/terms/",
         "ex":"https://example.com/99152/"
        }
    ],
    "@id": "ex:URIforNode1",
    "@type": "appropriate schema.org type",
    "identifier":"ex:URIforDescribedResource",
    "name": "unique title for the resource",
    "description": "Description of the resource",
    "subjectOf": {
        "@id": "ex:URIforNode2",
        "@type": "DigitalDocument",
        "dateModified": "2017-05-23",
    "identifier":"ex:URIforNode1",
        "description":"metadata about documentation for ex:URIforDescribedResource",
        "dcterms:conformsTo": {"@id":"ex:cdif-metadataSpec"}
    }        
   }

Example 1. Metadata about the metadata embedded.

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforNode1",
            "@type": "Dataset",
            "identifier": "ex:URIforDescribedResource",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "identifier": "ex:URIforNode1",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "ex:cdif-metadataSpec"}
        }
    ]
}

Example 2. Metadata about metadata as a separate graph node.

Including the schema:description with the string "metadata about documentation for ex:URIforDescribedResource" will allow disambiguating different usages of the subjectOf property. The ex namespace in the example above is only included so the example is valid; actual metadata would likely have its own namespace for resource and metadata URIs. The distinct identifier for the metadata record (ex:URIforNode1) allows statements to be made about the metadata separately from statements about the resource it describes.

smrgeoinfo commented 3 weeks ago

another possible solution:

{
    "@context": [
        "https://schema.org",
        {"ex": "https://example.com/99152/"}
    ],
    "@graph": [
        {
            "@id": "ex:URIforDescribedResource",
            "@type": "Dataset",
            "name": "unique title for the resource",
            "description": "Description of the resource"
        },
        {
            "@id": "ex:URIforNode2",
            "@type": "DigitalDocument",
            "dateModified": "2017-05-23",
            "url": "ex:URIforDescribedResource",
            "description": "metadata about documentation for ex:URIforDescribedResource",
            "dcterms:conformsTo": {"@id": "ex:cdif-metadataSpec"}
        }
    ]
}

The metadata record can use @id with identifier for the described resource, so the generated triples with @id make sense. The node with information about the metadata record links to its target metadata using the sdo:url property, under the interpretation that dereferencing a node identifier should return the JSON object that has that @id. Seems less divergent with common usages that the sdo:identifier approach suggested above.

better yet instead of "url": ...

"about":{
     "@type":"DigitalDocument",
     "url":"ex:URIforDescribedResource"} 

seems clearer to me.

including an "additionalType":"ex:metadataDocumentation" or something like that in the metadata about metadata node would also help clarify things.