BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
51 stars 50 forks source link

What are the actual canonical URLs of terms in bioschemas? #653

Open ptsefton opened 10 months ago

ptsefton commented 10 months ago

In the bioschemas schema the @context has:

    "bioschemas": "https://discovery.biothings.io/view/bioschemas/",

And Classes and properties are defined below. Eg:


{
      "@id": "bioschemas:ComputationalWorkflow",
      "@type": "rdfs:Class",
      "rdfs:comment": "A computational workflow ...",
      "rdfs:label": "ComputationalWorkflow",
      "rdfs:subClassOf": {
        "@id": "bioschemastypes:ComputationalWorkflow"
      },

Which would mean that the URL to use for ComputationalWorkflow should be https://discovery.biothings.io/view/bioschemas/ComputationalWorkflow, right? This does in fact resolve to some documentation as does the property https://discovery.biothings.io/view/bioschemas/input albeit not with a very good description of input.

BUT this URL also resolves to some documentation: https://bioschemas.org/ComputationalWorkflow which says that the canonical URL for ComputationalWorkflow is https://bioschemas.org/ComputationalWorkflow though https://bioschemas.org/input doesn't resolve.

(I came here from the RO-Crate project and Crate-O trying to sort out a bug we had related to this -- which arose from what I think is an error in our default context where input and output are linked to the ComputationalWorkflow page

          "input": "https://bioschemas.org/ComputationalWorkflow#input",
          "output": "https://bioschemas.org/ComputationalWorkflow#output",

But a colleague @alex-ip working using the Bioschemas Schema definition (linked above) has been using https://discovery.biothings.io/view/bioschemas/input. )

Is there a standard context for bioschemas that includes all these terms as there is for schema.org and RO-Crate?

My colleague @alex-ip found this example: https://bio.tools/api/blast?format=jsonld

This is using the following @context definitions based on, I assume the assumption that the canonical URL for the schema is bioschemas.org:

        "bsc": "http://bioschemas.org/",
        "hasInputData": "bsc:input",

But as noted above http://bioschemas.org/input does not resolve.

So, what are the IDs of these Classes and Properties?

gtsueng commented 10 months ago

Bioschemas has Specifications (classes) which are Profiles (and have additional property-related constraints) and Types (which do not). Hence to resolve to the appropriate specification on the Bioschemas website, you would include that as part of the base url (eg- 'https://bioschemas.org/types/' or 'https://bioschemas.org/profiles/') to resolve to the desired specification. To my knowledge, the bioschemas site does not resolve properties, which made it tricky to generate the JSON file you referenced.

The schema (json) files for Bioschemas were generated using (and registered to) the Data Discovery Engine (DDE) hence the urls that pointing there should resolve regardless of whether it's a property or a class. That said, there are limitations with the tool in that multiple classes are not allowed in the same name space. Hence, only the latest version of a draft and latest version of a release is available from the DDE site. This is also why there are multiple namespaces, resulting in multiple urls on the DDE (e.g.- /bioschemastypes/ for released Types, /bioschemas/ for released Profiles and another set for drafts).

ptsefton commented 10 months ago

Thanks for getting back to me @gtsueng

For our current purposes I am not concerned with how to identify profiles, just Types (AKA Classes) and Properties.

My understanding is that Types in schema.org are essentially the same as Classes - the Schema.org schema defines the schema.org Types using rdfs:Class definitions, while Properties are called the same thing on both the Schema.org websites and in the schema definition (rdf:Property). In order for bioschemas to work in a linked data context and for people to use these schemas in the way they use Schema.org it has to be clear what the URL is that identifies each class and property. Ideally (IMO) these URLs should resolve to something useful, as they do in Schema.org but not all linked-data communities care about that -- some are happy to have URLs resolve to RDF documents or technical stuff.

You are correct that it makes it tricky to generate JSON-LD with these schemas as it is not clear what implementers are supposed to do to generate bioschemas documents but it is not possible to implement systems that create bioschemas markup with knowing the answer. How are your Types (Classes) and Properties identified using URLs? Is it as per the schema which resolves to the DDE documentation or as per the bioschemas website which has Type (Class) documentation?

gtsueng commented 10 months ago

Since all the projects I work on also use the DDE, I usually resolve any classes or properties for those projects using the DDE.

ptsefton commented 10 months ago

For interop it's important that everyone does it the same way - otherwise there is no way to tell that two documents are using the same terms, this is fundamental to linked-data and the schema.org approach on which bioschemas is based.

marco-brandizi commented 10 months ago

So far, I've just used https://bioschemas.org/$type, knowing that these URIs don't resolve to machine-readable data (RDF or JSON-LD) and assuming they eventually will go under the schema.org namespace.

This ticket tells me Bioschema should decide some policy about URIs and take corresponding actions, such as redirecting canonical URIs to the DDE.

I'm in favour of a simple canonical namespace like bioschemas.org and without paths for types/profiles/drafts/properties/etc, cause the latter is more complicated to manage (you need to declare multiple namespaces everytime and remember which one you need every time).

Applications can know what a type exactly is (including if it's stable or draft) by resolving its URI, if one needs to refer a given version of a type, we might have a canonical URI that always point to the last (stable?) version, plus versioned URIs, eg, bioschemas.org/ComputationalWorkflow -> bioschemas.org/ComputationalWorkflow/2.0 and bioschemas.org/ComputationalWorkflow/1.0 exists too.

Obviously, I'm not saying anything new, similar policies have been applied for years in ontology and linked data projects.

stain commented 10 months ago

The use of https://discovery.biothings.io/ns/bioschemas/ as a temporary namespace seems to be a new thing due to how the DDE editor work, and should not be how Bioschemas' profiles are published. For one thing, this namespace is not in control by Bioschemas community, but by biothings.io. Secondly if a PID is to be established it should be by redirection from a PURL service, not directly leading into the UI of however service works today.

For compatibility with schema.org I would also have expected https://bioschemas.org/input etc. for the properties, but in reality these property links don't work, only for the types.

Some of the types HTML (but not ComputationalWorkflow) do have their id=property HTML tags, so for instance https://bioschemas.org/BioSample#custodian works as you would expect, going to the right row in the table.

Types and properties should not be versioned in their PID, because a 1.0 ComputationalWorkflow is semantically also a 2.0 ComputationalWorkflow - however a profile from conformsTo would show the version of conformance.

Some properties are shared in multiple types, for instance BioSample extends BioChemEntity, which before it was merged into schema.org proper would have had properties like https://bioschemas.org/BioChemEntity#associatedDisease (now http://schema.org/associatedDisease) -- but few would probably set up their @context correctly as there is no common JSON-LD ocntext for Bioschemas so anyone using the non-settled types will invariably do it in many different ways as it's not yet documented which URIs they have.

marco-brandizi commented 10 months ago

For compatibility with schema.org I would also have expected https://bioschemas.org/input etc. for the properties, but in reality these property links don't work, only for the types.

I can't see the problem: if Bioschemas needs to add a new property, it can adopt the https://bioschemas.org/<propertyName> pattern, like the classes, and things can be set up so that data about the property or HTML about it is returned (as usually, via content negotiation).

Types and properties should not be versioned in their PID, because a 1.0 ComputationalWorkflow is semantically also a 2.0 ComputationalWorkflow - however a profile from conformsTo would show the version of conformance.

I'm not sure what PID is. Apart from that, ComputationalWorkflow v2 might not be completely semantically equivalent to ComputationalWorkflow v1 at a formal specification level (not even considering subsumption), for you might have inconsistent specifications or one more general than the other, roughly, as it happens for a class or function name in a Java or Python library. Certainly, we should have a short name like ComputationalWorkflow, which shouldn't include the version, and also bioschemas.org/2.0/ComputationalWorkflow/ is better than bioschemas.org/ComputationalWorkflow/2.0, contrary to what I initially wrote.

ptsefton commented 10 months ago

So is there someone at Bioschemas who can make a determination on this?

stain commented 9 months ago

I would also wish for https://bioschemas.org/{propertyName} to work, but the current structure do not have a page per property like at schema.org, so it would have to redirect to https://schema.org/TypeThatFirstIntroducedIt#{propertyName} or we make such pages.

marco-brandizi commented 9 months ago

I would also wish for https://bioschemas.org/{propertyName} to work, but the current structure do not have a page per property like at schema.org, so it would have to redirect to https://schema.org/TypeThatFirstIntroducedIt#{propertyName} or we make such pages.

'first type that introduced it' might not be ambiguous (it could be based on the creation date), but defining a property description to feed a URL is a cleaner path and such a description could be more informative than landing on some usage example.

By the way, how many new bioschemas properties do we have? I don't remember very many.

albangaignard commented 9 months ago

Hi all, let's take the example of https://bioschemas.org/input property. This was introduced because it's not (yet) part of the Schema.org spec. The issue is that it cannot be de-referenced.

I would go for creating a page in the Bioschemas for each of these "dead" links. I think it would be less confusing than introducing another namespace.

Would that be ok ?

ptsefton commented 9 months ago

@albangaignard as an outsider that makes perfect sense to me -- we could then add these terms to our RO-Crate context and change them if/when the terms are added to schema.org. It would also be helpful if the documentation made it clear how to refer to the terms, as it is clear that some bioschemas community members are using different URIs for the terms including in the which defeats the purpose of using Linked Data.

BTW, also as an outsider though the properties inputand output in particular ring alarm bells for me; these are semantically the same as or very close to, object and result on http://schema.org:CreateAction, as used here: https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate. Not my project, but do you really need new terms?

ljgarcia commented 9 months ago

@ptsefton I thought (my recollection, could be wrong) I had replied via the Slack integration but my reply is not here at all, so adding it now. Thanks for bringing this up. We are aware of the namespace issues for new types and properties. We are in the process of getting help to move this forward. I will share the strategy once defined so we can also get feedback.

As for the input/output, I am moving your comment to a new discussion, dedicated to those two properties.

ljgarcia commented 9 months ago

@albangaignard as an outsider that makes perfect sense to me -- we could then add these terms to our RO-Crate context and change them if/when the terms are added to schema.org. It would also be helpful if the documentation made it clear how to refer to the terms, as it is clear that some bioschemas community members are using different URIs for the terms including in the which defeats the purpose of using Linked Data.

BTW, also as an outsider though the properties inputand output in particular ring alarm bells for me; these are semantically the same as or very close to, object and result on http://schema.org:CreateAction, as used here: https://www.researchobject.org/workflow-run-crate/profiles/process_run_crate. Not my project, but do you really need new terms?

@ptsefton New discussion for input/output vs object/result at https://github.com/BioSchemas/specifications/issues/655