Standard way to reference "Information Resources" in Translator data

biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology

https://biolink.github.io/biolink-model/

Other

170 stars 71 forks source link

Standard way to reference "Information Resources" in Translator data #717

Closed mbrush closed 3 years ago

mbrush commented 3 years ago

SRI, in collaboration with other Translator teams, is working to provide standard, computable IRIs and representations of Information Resources that provide knowledge to and within the Translator system.

An Information Resource is defined as a database or knowledgebase and its supporting ecosystem of interfaces and services that deliver content to consumers (e.g. web portals, APIs, query endpoints, streaming services, data downloads, etc.). A single Information Resource by this definition may span many different datasets or databases bases, and include many access endpoints and user interfaces. Information Resources include Translator KPs, and external community knowledgebases like ChemBL, or DGIdb.

Most developers felt it important that references to Info Resources in the data be human readable (mainly for development and testing purposes).

Several proposals have been put forth about the source and pattern for referencing Information Resources in the data:

Mint/register new iris using w3id for all resources (internal KPs and external community resources). The id component of the IRIs here could be opaque or human readable.
Mint/register w3id iris for Translator resources, and use Wikidata iris for community resources (will require WD creating new IRIs for resources not currently registered. And note that Wikidata IRIs are not human readable).
Use wikidata for all resources (will require registering Translator KPs to get a Wikidata ID for them)
Punt on the question of formal IRIs for now, and just define free-text strings as enumerations. e.g. "chembl", "text-mining_kp". These can be mapped to iris / identifiers in an external location, as a separable concern/activity from enum generation and use in the data.

Given the requirements for rapid support and human readability, the preferred approach is (1) with a human readable id, or (4). A spreadsheet template is being developed here to hold ids/ enums, and metadata about Information Resources relevant to Translator.

This ticket is to track feedback on and development of this proposal.

nlharris commented 3 years ago

This is being discussed at a meeting right now.

mbrush commented 3 years ago

Some folks in SRI in particular are uncomfortable minting IRIs with 'meaningful' id components (e.g. https://w3id.org/translator/resource/dgidb) - as this is generally considered bad practice given the potential for names of things to change. Also, we will likely at some point want to spin up a KG that holds/serves info about Information Resources, and we'll need reliable identifiers for nodes representing resources in this graph. Finally, we will likely want to mini IRIs for specific versions of Info Resources, which could pose challenges for baking into a consistent and unambiguous human readable id.

SRI folks are pushing for approach (1) above, where minted w3id-based IRIs have a numeric identifier component for this purpose (e.g. something like https://w3id.org/infores/0000001). As we do for Biolink classes, we can map these to any existing identifiers that may exist for a resource in places like Wikidata or FAIRSharing.org.

I know a few people have strongly advocated for a human readable way to reference Information Resources in our data - for purposes of being able to easily review / QA / debug data in fix-it sessions. Richard and others have proposed the idea of a simple utility that can add labels to curies in dev data for this purpose. Another possible solution would be to recommend KPs populate the 'description' field of any Attribute reporting source provenance metadata with a human readable name of the resource (along with any other info they want to report about it). e.g.

    {
      "attribute_type_id": "biolink:aggregator_knowledge_source", 
      "value":  "infores:0012345",
      "value_type_id": "biolink:InformationResource",     
      "value_url":  "https://www.ebi.ac.uk/chembl",
      "description": "ChEMBL.  This resource is a manually curated database of bioactive molecules...",
      "attribute_source": "infores:0045678"
    },

This could help human users easily discern the source asserted in an Attribute. But notably doesn't help with the resource asserted as the attribute_source'.

mbrush commented 3 years ago

The pendulum continues to swing . . . the prevailing thought now is that we should explore whether human readable w3ids could work. The rationale here is that the convention of opaque IRIs is customary in the ontology world, but Biolink is more of a schema - and here it is customary to have human-readable ids. Our Info Resource identifiers will be used in the Biolink schema space - so human readable ids may be preferred.

One complication with this could arise when minting human readable ids for specific versions of Info Resource - given that different resources may use different conventions for specifying versions (semantic versioning, ad hoc versioning, dates instead of versions). Our next task is to assemble a catalog of examples of human readable w3id IRIs for general and version level resource IRIs - to understand if this could work in practice.

mbrush commented 3 years ago

Some examples illustrating diversity of versioning approaches:

Chembl: "ChEMBL28" (https://www.ebi.ac.uk/chembl/)

https://w3id.org/infores/chembl, https://w3id.org/infores/chembl28

CIViC: continually updated, but downloads have release dates, e.g. 01-May-2021 (see https://civicdb.org/releases)

https://w3id.org/infores/civic, https://w3id.org/infores/civic01-May-2021

DGIdb: "DGIdbv4.2.0" (see https://www.dgidb.org/)

https://w3id.org/infores/dgidb, https://w3id.org/infores/dgidbv4.2.0

Monarch: downloads have release dates, e.g. 13-Apr-2021 14:18 (see https://archive.monarchinitiative.org/latest/)

https://w3id.org/infores/monarchinitiative, https://w3id.org/infores/monarchinitiative13-Apr-202114:18

MONDO: "Release v2021-03-03"

https://w3id.org/infores/mondo, https://w3id.org/infores/mondov2021-03-03

DrugBank: "DrugBank Release Version 5.1.8" (see https://go.drugbank.com/releases/latest)

https://w3id.org/infores/drugbank, https://w3id.org/infores/drugbankv5.1.8

GO Annotations: Release dates, e.g. "2021-05-01" (see http://geneontology.org/)

https://w3id.org/infores/goa, https://w3id.org/infores/goa2021-05-01

Handling resource versions:

create version-specific identifiers?
capture version in a separate attribute (e.g. a nested attribute) - punt for now

Identifier component of a curie:

base resource: lowercase, underscore separated, short form of resource name (e.g. dgidb, text_mining_kp, aragorn_ara)
version info (if we create version specific iris) . . . see above

Options for w3id namespace:

https://w3id.org/infores/
https://w3id.org/resource/

Options for prefix:

infores (infores:dgidb)
resource (resource:dgidb)

mbrush commented 3 years ago

On the 5-21 SRI call, we settled on minting IRIs in the w3id namespace that would give us the following (using DGIdb and Monarch as examples):

General resource IRI: https://w3id.org/infores/dgidb, https://w3id.org/infores/monarchinitiative
Specific resource version IRI: https://w3id.org/infores/dgidb.4.2.0, https://w3id.org/infores/monarchinitiative.13-Apr-2021
CURIEs: infores:dgidb, infores:dgidb.4.2.0, infores:monarch_initiative, infores:monarch_initiative.13-Apr-2021

Re: the identifier component of a curie:

base resource name: short form, lowercase, use base url for web address where possible, if needed, use a hyphen to separate words, (e.g. dgidb, text-mining-kp, aragorn-ara, monarchinitiative)
version info: appended to end, following a '_' separator (TO DO: confirm this is the separator we want to use)

The way resources version their releases varies widely of course, as documented above, so versioned IRIs would be highly variable. And obviously there were concerns raised about ids based on resource names, rather than opaque strings - given the potential for names to change over time. But after much debate, this is where things landed. Versions will be either ISO8601-formatted dates, or semantic-versioning-compliant version numbers, separated from the base resource name by a '.'.

nlharris commented 3 years ago

Discussed in Biolink Help Desk on 7-Jun-2021.

mbrush commented 3 years ago

June 23 Update: An 'Infores Catalog' has been created to store Info Resource identifiers, along with metadata about the Information Resource they specify. This catalog is currently split between two spreadsheet in the google document here.

'Translator Resources' includes resources generated by Translator teams, and registered in the smartAPI Translator Registry. Initially only version-specific IRIs will be created (as each registry entry is tied to a version).
'External Resources' covers pre-existing, external resources from which one or more Translator resource derives its content. The initial set of resources was seeded with content from Bioregistry.org, and only version-agnostic identifiers are created for now.

Action Item for KP and ARA reps:

Review Source Retrieval Provenance documentation - see here
Review 'name' (column I) assigned to each resource in the 'Translator Resources' sheet of the Infores Catalog that is affiliated with your Team (column M) - see here a. The 'name' column holds a short form name that will be the identifier component of an Infores IRI. The conventions/rules used to guide 'name' creation are described in Appendix 2 and 3 of the documentation here. b. Share any questions / feedback with SRI - you can add note to column D of the Infores catalog, post to Slack (mention @Matthew Brush, or comment in this ticket.

nlharris commented 3 years ago

@mbrush can this be closed now?

nlharris commented 3 years ago

Closing as done. Can be reopened if I closed it prematurely.