biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

What kind of 'type' should be captured in the Attribute.value_type_id field. #1106

Open mbrush opened 1 year ago

mbrush commented 1 year ago

The TRAPI Attribute object includes a value_type_id afield to indicate the "type" the thing reported in the 'Attribute.value` field.

We need to decide if we would like this field to capture the more foundational / technical data type of what is in the value field (e.g. CURIE, string, float, . . . ) , or a more semantic/ontological type of thing the value concept represents (e.g. "InformationResource", Publication, "Person", "p-value", ...)

The example that spurred this question on the 10-13-22 Data Modeling call is below, and concerns the value_type_id of "biolink:InformationResource" below.

{
  "edges": [
    {
      "id": "Association001",
      "subject": "CHEBI:3215",
      "predicate": "biolink:interacts_with",
      "object": "NCBIGene:51176",
      "attributes": [
        {
          "attribute_type_id": "biolink:primary_knowledge_source",
          "value": "infores:clinical-trials-gov",
          "value_type_id": "biolink:InformationResource",      # More useful here to capture "CURIE" in this field?
          "value_url": "https://www.clinicaltrials.gov",      
          "description": "ClinicalTrials.gov is...",
          "attribute_source": "infores:chembl"
        }
      ]
    }
  ]
}

Many felt it would be more useful to capture a more foundational "type" here (e.g. "CURIE" since the value here represented as a CURIE). Especially since the semantic/ontological type of the value will usually be knowable from the range of the Biolink edge property in the attribute_type_id field, or from the name of the edge property itself (e.g. biolink:p-value).


While this concerns elements of the TRAPI schema, this is a broader issue concerning modeling conventions, and Biolink support may ultimately be needed to implement our decision (e.g a enumeration of foundational data types to constrain this field). Tagging @edeutsch and @sierra-moxon for their input.

edeutsch commented 1 year ago

I think you captured this very well. The attribute_type_id conveys to the reader that the value will be primary_knowledge_source (which is_a knowledge_source). So all the reader then needs to know is how will you convey the identity of this knowledge source to it in the value field. via a URI? via a CURIE? via a proper name? All three are strings but value type of those strings are different and which form matters a lot to the reader when interpreting the string. I'd contend that the current value_type_id = biolink:InformationResource isn't useful because the reader already knew that from the attribute_type_id. What it needs to learn is how to interpret the string value. To me, that means URI or CURIE or proper name in this case. That is what I had in mind with value_type_id.

sierra-moxon commented 1 year ago

Noting that biolink already has a controlled vocabulary of types that it inherits from LinkML: https://github.com/linkml/linkml-model/blob/main/linkml_model/model/schema/types.yaml

sierra-moxon commented 1 year ago

using "CURIE" for this as the type makes sense to me. Do we need anyone else to weigh in on this, or shall we declare it to be the type of thing in the value field?