ESIPFed / science-on-schema.org

science-on-schema.org - providing guidance for publishing schema.org as JSON-LD for the sciences
Apache License 2.0
114 stars 33 forks source link

Best practices for @ID usage for everything #48

Open rduerr opened 4 years ago

rduerr commented 4 years ago

At the Polar Data Forum it was suggested that a best practice guidance section on minimizing blank nodes by using @id properly for all @type declarations is needed.

rduerr commented 4 years ago

For example:

`{ "@context": { "@vocab": "http://schema.org/" }, "@type": "Dataset", "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. " }

should be something like:

{ "@context": { "@vocab": "http://schema.org/" }, "@type": "Dataset", "@id": "https://doi.org/10.0000/0000/plr-stf.123456" "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016", "description": "This dataset includes results of laboratory experiments which measured dissolved organic carbon (DOC) usage by natural bacteria in seawater at different pCO2 levels. Included in this dataset are; bacterial abundance, total organic carbon (TOC), what DOC was added to the experiment, target pCO2 level. " }`

datadavev commented 4 years ago

See also #41

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity.

smrgeoinfo commented 4 years ago

The guidelines need to be clear that the @id identifies the object in the JSON-LD document, not a thing in the world that the JSON-LD is about.

dblodgett-usgs commented 4 years ago

How would you encode information about real world things then? Over in the SELFIE project, we are expecting a single @id as the subject of the whole document you get back after a 303 redirect.

http://geoconnex.us/SELFIE/usgs/huc/huc12obs/070900020601 which will resolve to a resource about that id that doesn't reference its self at all. Maybe I'm missing the point of this issue?

smrgeoinfo commented 4 years ago

https://json-ld.org/spec/latest/json-ld/#node-identifiers : In JSON-LD, a node is identified using the @id keyword, to be able to externally reference nodes in a graph. I understand this to mean @id is like the primary key on a database record. The node in the graph is a representation of some other thing (of type @type). In schema.org my suggestion would be to use schema:identifier to provide a URI for the thing as opposed to the node about the thing. In the end, what @id and or schema:identifier identify needs to be clearly explained in the documentation for your JSON-LD profile and understood by users of that profile.

datadavev commented 4 years ago

Yeah, I think the intent of #48 is to clarify that everything with a @type should also have an @id to avoid blank nodes in the json-ld graph (i.e. nodes without a URI), and so make parts of the graph reference-able from other graphs.

Agree that the use of @id and identifier needs to be clear. See also #13 and #41.

rduerr commented 4 years ago

datadavev has the original intent precisely! No blank nodes! The whole graph should be easily queriable (e.g., via sparql, etc.).

dr-shorthair commented 4 years ago

Not sure there can be a blanket ban on blank nodes. For example, qualified associations are typically a one-off.

dblodgett-usgs commented 4 years ago

Yeah. Seems heavy handed to me. What if you want to associate to a dataset that doesn't have a linked data ID but is available "out of band" at some URL.

A blank node with information (name, etc.) and a schema:url property is really useful. Otherwise, we set up this pattern where people will put any old URL in as the @id and we'll have a mess of in-band and out of band linked data on our hands.

datadavev commented 4 years ago

Agreed that this can't be a requirement, more a recommendation. It would be absolutely useless to assign a bunch of short lived or otherwise fragile URIs to the @ids just to meet a requirement. Note though that @id values can be relative URIs (e.g. ./metadata) which are then relative to the absolute location of the json-ld containing document.

Consideration should be given to the longevity of these values, especially when parts of a dataset may be used as a component of a workflow or some such. Will a reference to some component of a Dataset be valid 10 years from now? What happens when the host has gone through a few name changes and the data may be located in several different repositories?

datadavev commented 4 years ago

Perhaps a general recommendation could be to have the JSON-LD present at the location where the Dataset identifier resolves to, and relative URIs for components of the dataset (unless components also had reliable URIs)? Including the DOI, ARK or other persistent URI for the identifier would provide a stable reference point for clients. So then the guideline may be something like, if the component @id is a relative URI, then it should be resolve-able as relative to the identifier value of the root node of the graph.

Using Ruth's example from above, I think it would be something like:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "datacite": "http://purl.org/spar/datacite/"
  },
  "@type": "Dataset",
  "@id": "./",
  "identifier": {
    "@id":"./identifier",
    "propertyID":"datacite:doi",
    "value":"10.0000/0000/plr-stf.123456",
    "url": "https://doi.org/10.0000/0000/plr-stf.123456"
  },
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016"
}

Then a client could determine that regardless of where it actually retrieved the json-ld from, the reliable reference to it and its components is relative to the identifier url.

dblodgett-usgs commented 4 years ago

I see the merit to this approach but it kind of pushes the problem elsewhere. You still have to have whatever the . represents be resolvable to something sem-webby if it's going to be worth anything. Paste the above example into the playground and you get expanded @id like: "https://json-ld.org/playground/identifier"

I think the "no blank nodes" ideal is a good one but it should be an ideal not a requirement.

In the example above, we are saying here's a dataset we know something about. It has an identifier that doesn't have a semantic web identifier but we do know something about. That's a pretty valid and realistic way to go.

datadavev commented 4 years ago

The relative URI is a valid semantic web identifier. The URI is relative to the location of the document - since the document was retrieved to be viewed, it must therefore be resolved.

Should the host change in the future, it can be located from the url of the identifier. If that url is no longer valid, then it can be located from some resolution service that understands what to do with a datacite:doi with that value.

smrgeoinfo commented 4 years ago

I agree, as far as the @id to identify graph nodes that might or might not be reused, using relative URIs makes abundant sense. It gets more interesting with graph nodes that will likely be reused (e.g. organization, person...)

rduerr commented 4 years ago

@dblodgett-usgs I agree that there should be a difference between the specification (if that is what we are calling it) and the best practices. The best practices needs to be much stricter than the specification if we are to facilitate the I and R of FAIR more generally. Otherwise, we end up with trying to build systems that can deal with totally in-homogeneous metadata like the various ISO 19115 profiles + all the other metadata (non)standards.

@smrgeoinfo I agree with you, though the list of things that need to be reused also includes instruments, sensors, protocols, tools, API's, etc. (all the things that the ESIP Research Object Citation cluster and the EarthCube Resource Registry are dealing with) which all need URI's. That would be the longer term goal, even if some repositories haven't reached that state yet.

valentinedwv commented 1 year ago

Need a warning about reusing identifiers. Was trying to figure out why the @type definition was getting expanded to['Dataset', 'PropertyValue'], when I flattened or framed a document.

It turns out the @id on Dataset, and the @id on the identifier are the same. Main issue is that this makes finding information with JSON Path a pain (oh and round tripping the jsonld>rdf>jsonld) All the identifier properties get moved up a level, which means no $.identifier with a propertyID, or value is found; they are at the top level.

{
    "@context": {"@vocab":"https://schema.org/"},
    "@type": "Dataset",
    "version": "1.0",
    "additionalType": ["geolink:Dataset", "vivo:Dataset"],
    "name": "Poleta Folds, southern Deep Springs Valley, California",
    "alternateName": "POLETA",
    "@id": "https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.052015.32611.1",
    "identifier": {
        "@id": "opentopoID=OTLAS.052015.32611.1](https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.052015.32611.1)",
        "@type": "PropertyValue",
        "propertyID": "opentopoID",
        "value": "OTLAS.052015.32611.1"
    }
}

becomes

{'@context': {'@vocab': 'https://schema.org/', 
   'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#', 
   'rdfs': 'http://www.w3.org/2000/01/rdf-schema#', 
   'schema': 'https://schema.org/', '
  xsd': 'http://www.w3.org/2001/XMLSchema#'}, 
     '@id': 'https://portal.opentopography.org/lidarDataset?opentopoID=OTLAS.052015.32611.1', 
    '@type': ['Dataset', 'PropertyValue'],
    'additionalType': ['geolink:Dataset', 'vivo:Dataset']
}

0024e35144d902d8b413ffd400ede6a27efe2146.jsonld.txt 0024e35144d902d8b413ffd400ede6a27efe2146_orig.jsonld.txt