BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
53 stars 52 forks source link

Markup should state its conformance #297

Closed AlasdairGray closed 1 year ago

AlasdairGray commented 5 years ago

It would be good if our markup included a statement about what profile and version it conforms with. This would make it easier to maintain and validate versions.

There is no schema.org property for this. A property that could be used is dcterms:conformsTo.

This would apply to every profile and should be included in much the same way that @id and @type are used.

stain commented 3 years ago

I tried expanding on this for RO-Crate, but ended up in situations where dct:conformsTo becomes misleading.

For instance we have in https://www.researchobject.org/ro-crate/1.1/workflows.html#complying-with-bioschemas-computational-workflow-profile

{ "@id": "workflow/alignment.knime",  
  "@type": ["File", "SoftwareSourceCode", "ComputationalWorkflow"],
  "conformsTo": 
    {"@id": "https://bioschemas.org/profiles/ComputationalWorkflow/0.5-DRAFT-2020_07_21/"},
  "..": ""
},
{
  "@id": "#36aadbd4-4a2d-4e33-83b4-0cbf6a6a8c5b",
  "@type": "FormalParameter",
  "conformsTo": 
    {"@id": "https://bioschemas.org/profiles/FormalParameter/0.1-DRAFT-2020_07_21/"},
  "..": ""
}

But when writing up how profiles should be indicated globally for RO-Crate and for other types of files in ResearchObject/ro-crate#154 it occurred to me that my usage here of conformsTo on the .knime file is very misleading - if you download it, it would just be an XML file with no mention at all of schema.org.

This is perhaps less of an issue of abstract things like the formal parameter #36aadbd4-4a2d-4e33-83b4-0cbf6a6a8c5b which only exists in the structured data. But for file-like resources then one would expect to be able to use dct:conformsTo to describe the profile of the file itself, e.g.:

{
    "@id": "biosketch.docx",
    "@type": "File",
    "name": "NIH Biosketch for Alice W Land",
    "encodingFormat": [
        "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        {"@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/412"}
    ],
    "conformsTo": {"@id": "https://grants.nih.gov/grants/forms/biosketch.htm"}
},
{
    "@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/412",
    "@type": "WebSite",
    "name": "Microsoft Word for Windows 2007 onwards"
}
{
    "@id": "https://grants.nih.gov/grants/forms/biosketch.htm",
    "name": "Biosketch Format Pages, Instructions and Samples"
}

That's why I tried with introducing sdConformsTo as a way to say this particular RDF subject - it has a structured data (here) that conforms to a particular profile. Equivalent to https://schema.org/sdLicense being different from https://schema.org/license

I now think it makes more sense to just list all the profiles conformed to in the top level "" subject of the page itself and avoid this super-high-level detail of sdConformsTo or conformsTo on every @id in a page, rather they should be listed in the top only.

If a page describes multiple things then it should be a CreativeWork that is about the other things - but in the more regular case of schema.org typical usage than the 'page' and the 'thing' is the same.

AlasdairGray commented 3 years ago

Agree is would be more convenient to have a schema property for this such as sdConformsTo as you suggest. This would eliminate confusion over whether to use prefixes or full URLs.

AlasdairGray commented 3 years ago

I'm not sure I follow the argument for having all the declarations at the top level. I would think that any validation tool would need to know which profile a specific node in the graph is meant to conform with.

stain commented 3 years ago

I'm not sure I follow the argument for having all the declarations at the top level. I would think that any validation tool would need to know which profile a specific node in the graph is meant to conform with.

Agree on that.. My thought there is that it is at least simple to add multiple to the top-level object, but then that object can itself only be a CreativeWork "metadata" HTML page that is about say a ChemicalSubstance – forcing a split between the information page and whatever it describes – which is not typical schema.org approach (but appropriate where existing identifiers like ORCID and DOI are reused as @id, the page is then implicit @id: "" for current page).

conformsTo on a non-information resource like ChemicalSubstance would perhaps make more sense to use directly on the same object as you can't download a substance. But then it would be weird to say the substance also complies with DataCatalog profile - just because one is mentioned deeper in.

It could also be that one Bioschemas object have only a partial description of another (e.g. isBasedOn), but only the top-level object is intending to comply with the profile – the consumer would need to follow the link to see the conforming description of the second level.

So overall the cleanest is a new sdConformsTo which can be used consistently across all types. And then information resources can themselves follow other profiles for their particular file formats.

If schemaorg/schemaorg#1516 were to add both, then as Bioschemas update to reflect, we could recommend sdConformsTo in all cases.

stain commented 3 years ago

schemaorg/schemaorg#2887 instead suggests a new structuredData property – I am now in favour of that instead of more sd* properties.

In that case one would not have any conflicts, for instance this Knime file is both an XML file conforming to an XSD but also has structured data (this JSON-LD) that conforms to a Bioschemas profile.

{ "@id": "workflow/alignment.knime",  
  "@type": ["MediaObjet", "SoftwareSourceCode", "ComputationalWorkflow"],
  "encodingFormat": "application/xml",
  "conformsTo": "http://www.knime.org/XMLConfig_2008_09.xsd",
  "structuredData": {
      "conformsTo": "https://bioschemas.org/profiles/ComputationalWorkflow/1.0-RELEASE/"
  },
  "…": ""
},
{
  "@id": "#36aadbd4-4a2d-4e33-83b4-0cbf6a6a8c5b",
  "@type": "FormalParameter",
  "structuredData": {
      "conformsTo": "https://bioschemas.org/profiles/FormalParameter/1.0-RELEASE/",
  },
  "…": ""
}

The structuredData object (if accepted in schemaorg/schemaorg#2887) should also be useful for describing provenance of the metadata (isBasedOn), which may be applicable to many Bioschemas uses.

AlasdairGray commented 3 years ago

Related to #294

AlasdairGray commented 2 years ago

Rereading this, I don't agree with you @stain. I think regardless of whether we have sdConformsTo of dct:conformsTo we need it on the resource that is identified and typed, at least in Bioschemas where we have profiles at a type level. For example, a page describing a protein would conform to the Protein Profile but the description of the Genes that it encodes within markup would conform to the Gene Profile. Validation tools would need to know which profile a specific resource conforms to otherwise they will be back to guessing.

stain commented 2 years ago

See also ResearchObject/ro-crate#187 where we suggest going back to having conformsTo directly on various @id objects within the same JSON-LD, even if this is a fudge.

This is a liberal interpretation of conformsTo as it is the structured data about the workflow (this JSON-LD object) that conforms to the ComputationalWorkflow profile, not the file content of a workflow data entity (workflow/alignment.knime). Instead of introducing a sdConformsTo similar to sdPublisher, we here follow the current Bioschemas convention of indicating profile conformance when the JSON-LD is embedded within HTML pages.

AlasdairGray commented 1 year ago

If an sdConformsTo property is added in the future, then we will reconsider our approach. For now we will continue with dct:conformsTo.