BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
51 stars 50 forks source link

Best Practice: DefinedTerm from Ontologies #652

Open sneumann opened 10 months ago

sneumann commented 10 months ago

Hi, several people are representing terms from ontologies via DefinedTerm (example ), but I guess there are different flavors out there how exactly to do that. Hence I would like to call for 1) better documentation, e.g. on our Getting Started tab, and/or 2) even a profile for an ontology-backed DefinedTerm. The main rationale is that I see validators and harvesters starting to connect to terminology services, so we should make it easy for them to recognise and follow ontology terms.

So, starting towards better documentation, can we come up with examples and promises how to represent a DefinedTerm ?

{
    "@type": "DefinedTerm",
    "@id": "http://purl.bioontology.org/ontology/NCBITAXON/9606"
    "termCode": "9606",
    "url": "http://purl.bioontology.org/ontology/NCBITAXON/9606",
    "inDefinedTermSet":
    {
        "@type": "DefinedTermSet",
        "name": "NCBI taxon",
        "url": "https://bioportal.bioontology.org/ontologies/NCBITAXON"
    },
    "sameAs":
    [
        "http://purl.uniprot.org/taxonomy/9606",
        "https://identifiers.org/taxonomy:9606",
        "http://purl.obolibrary.org/obo/NCBITaxon_9606"
    ]
}

I am most concerned about our recommendations for @id, identifier, url, termCode, all of which somehow identify/lead to the ontology term.

Similarly, we might want recommendations for the DefinedTermSet. Above we have:

    {
        "@type": "DefinedTermSet",
        "name": "NCBI taxon",
        "url": "https://bioportal.bioontology.org/ontologies/NCBITAXON"
    }

Is that enough as minimum information ? Very often we have @context, @id and for profiles dct:conformsTo as marginality minimum. How do we tell validators that there is an external controlled vocabulary/ontology behind a term, and not just a flat list of hasDefinedTerm in the set ? Do we specify the ontology lookup services as url ? Anything else we'd need for DefinedTermSet ?

Yours, Steffen

sneumann commented 8 months ago

An activity at the BH_2023 was to analyse what DefinedTerm definitions we find in the wild, specifically in the live-deploy's exampleURLs. Please see https://github.com/elixir-europe/biohackathon-projects-2023/tree/main/7/DefinedTerms for the collection and analysis of 30 DefinedTerms

ivanmicetic commented 8 months ago

Expanding on the format of DefinedTerm and DefinedTermSet, I would suggest a few modifications to the above example:

Things to note or possible issues:

Examples:

sneumann commented 7 months ago

Hi @ivanmicetic , I am unsure about "termCode": "0005670" without prefix. The termCode alone is used nowhere in the owl:

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/ECO_0005670">
        <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">ECO:0005670</oboInOwl:id>
        <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/ECO_0005670"/>

While there is heavy confusion about ECO_0005670 and ECO:0005670, the above example does not mention what (here: ECO) goes before the : or _. Unfortunately, there is no termPrefix property in the DefinedTermSet :-( Without prefix, it is difficult to impossible to build any URLs or parameters to some established services like OLS.

The DefinedTerm documentation says "termCode: A code that identifies this DefinedTerm within a DefinedTermSet.", which is indeed a bit vague ...

If we'd stick with "termCode should not have the prefix of the ontology", I'd love to have a pointer to a resource that recommends this. Maybe that could be the definition from identifiers.org ? Does that work for all ontologies we have in OBO and bioportal ?

Yours, Steffen

marco-brandizi commented 7 months ago

These things can never be made straight and we always have to live with them. In the KnetMiner project, we treat IDs like ECO_0005670 as accessions, usually attaching the source (GO, ECO, ENSEMBL, etc), and associating an item to the multiple accessions and accession variants it might have (ECO_XXX, ECO:XXX, etc).

termCode is a good property to represent such accessions, including the prefix, whatever separator is used for it, so, I usually do termCode = ECO_0005670 or termCode = ECO:0005670, usually depending on the data I import.

We rarely need to extract the 'term code' in the sense of the numerical part. To me, it doesn't mean much, apart from rare and peculiar use cases.

One case where we consider the composition is when we try to merge entities with the same or very similar accessions, eg, if one term has ECO0005670 as accession and another ECO:0005670, then they're very likely the same, and this can be detected with a merge/normalisation tool, using a regex like `/[a-z]+[:\b-]?[0-9]+/i`.

Apart from that case, We never consider the numerical part alone and I've never felt the need to store it in cleaned/published data. There might be use cases where you actually want it, but adopting the idea that termCode isn't just for numerical codes, you can do: x termCode 'ECO:0005670', 'ECO_0005670', '0005670'.

ivanmicetic commented 7 months ago

@sneumann

If we'd stick with "termCode should not have the prefix of the ontology", I'd love to have a pointer to a resource that recommends this. Maybe that could be the definition from identifiers.org ?

The only place where termCode is separated and possibly defined from termPrefix is identifiers.org: resource Local Unique Identifier (LUI) pattern Prefix embedded in LUI
EDAM ^(data|topic|operation|format)_\d{4}$ No
NCBI taxonomy ^\d+$ No
ECO ^ECO:\d{7}$ Yes
GO ^GO:\d{7}$ Yes

Note that this applies to compact identifiers or sample URLs for identifiers.org identifiers and not elswhere since ECO itself uses both : and _. GO behaves more coherently and uses only :.

Here you can find a spreadsheet with the summary of proposed solutions/recommendations for DefinedTerm discussed in this issue. We could use it to see the most favored solution and to monitor the evolution/progress of this new profile (if you find it useful).

Regards, Ivan

sneumann commented 7 months ago

Hi,

my initial urge was to say "duh, the local identifiers without prefix are useless, since there is no context and I wouldn't know how to use 'em then", similar to @marco-brandizi comment above. Hence, I really had hoped we'd find a way to document this to be a https://en.wikipedia.org/wiki/CURIE. There is some recommendations in the documentation of the curies package: https://curies.readthedocs.io/en/latest/ We could recommend to use the CURIE as it comes in Bioregistries https://bioregistry.io/registry/chmo .

Yours, Steffen

sneumann commented 7 months ago

Hi again, as part of the discussion I hacked a jq script to reshape the response from the OLS based terminology service to return a DefinedTerm. The mapping would be

{
  "@type": "DefinedTerm",
  "@id": ._embedded.terms[0].iri,
  "termCode": ._embedded.terms[0].obo_id,
  "name": ._embedded.terms[0].label,
  "url": ("https://terminology.nfdi4chem.de/ts/ontologies/"+._embedded.terms[0].ontology_name+"/terms?iri="+._embedded.terms[0].iri),
  "inDefinedTermSet": {
    "@type": "DefinedTermSet",
    "@id": ._embedded.terms[0].ontology_iri,
    "name": ._embedded.terms[0].ontology_name,
  }
}

or the equivalent command line:

wget -q -O- https://service.tib.eu/ts4tib/api/ontologies/chmo/terms?CURIE=CHMO:0000470 |\
 jq '{ "@type": "DefinedTerm", "@id": ._embedded.terms[0].iri, "termCode": ._embedded.terms[0].obo_id, "name": ._embedded.terms[0].label, "url": ("https://terminology.nfdi4chem.de/ts/ontologies/"+._embedded.terms[0].ontology_name+"/terms?iri="+._embedded.terms[0].iri), "inDefinedTermSet": { "@type": "DefinedTermSet", "@id": ._embedded.terms[0].ontology_iri, "name": ._embedded.terms[0].ontology_name, } } '

resulting in

{
  "@type": "DefinedTerm",
  "@id": "http://purl.obolibrary.org/obo/CHMO_0001921",
  "termCode": "CHMO:0001921",
  "name": "fluorescence anisotropy decay curve",
  "url": "https://terminology.nfdi4chem.de/ts/ontologies/chmo/terms?iri=http://purl.obolibrary.org/obo/CHMO_0001921",
  "inDefinedTermSet": {
    "@type": "DefinedTermSet",
    "@id": "http://purl.obolibrary.org/obo/chmo.owl",
    "name": "chmo"
  }
}

And yes, this comment is also I know where to put these code snippets to find later :-) The information about the DefinedTermSet above is a bit poor, and would require a second call to obtain a bit more FAIR information. Yours, Steffen

ivanmicetic commented 7 months ago

@sneumann, I agree that the term code without prefix is quite useless and would favour the use of CURIEs. I made a quick look at the curies package and I like how they solved the standardization of CURIEs in order to use multiple synonym prefixes as well as URI prefix synonyms:

from curies import Converter, Record

converter = Converter([
    Record(
        prefix="GO",
        prefix_synonyms=["gomf", "gocc", "gobp", "go", ...],
        uri_prefix="http://purl.obolibrary.org/obo/GO_",
        uri_prefix_synonyms=[
            "http://amigo.geneontology.org/amigo/term/GO:",
            "https://identifiers.org/GO:",
            ...
        ],
    ),
    # And so on
    ...
])

>>> converter.standardize_prefix("gomf")
'GO'
>>> converter.standardize_curie('gomf:0032571')
'GO:0032571'
>>> converter.standardize_uri('http://amigo.geneontology.org/amigo/term/GO:0032571')
'http://purl.obolibrary.org/obo/GO_0032571'

Maybe we could translate this concept to DefinedTerm, or at least force the use of standardized CURIEs (and standardized IRIs) in our profiles?