biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
171 stars 71 forks source link

modeling observations/measurements/assays #858

Open sierra-moxon opened 3 years ago

sierra-moxon commented 3 years ago

Some use cases: 1) recording the measurement process for a specific trait. 2) recording the methods used to elicit a finding (ie: in-situ hybridization assay was used in a gene expression study)

We already have biolink:ClinicalMeasurement class and the 'biolink:HasAttribute' slot, but does clinical measurement include the measuring process itself?

One suggestion from @ramonawalls is an "observation process" class as a parent to an "assay" class.

sierra-moxon commented 3 years ago

Alliance uses the MMO (measurement method ontology) to describe particular assays.

ramonawalls commented 3 years ago

We use observing process as part of our core model in BCO. For biodiversity informatics, the two key ways of getting information about an organism are observing it and collecting it, thus observing process and specimen collection process (from OBI) are central terms in BCO. The first had data as a specified output and the second has a specimen as a specified output. These are both OBI:planned process, so observing process is a sibling to assay, rather than a parent.

Why is it important to include the process for trait/phenotype data? Because unlike mutant phenotypes recorded in a lab, for most phenotypic data, the situation under which they are recorded often has some bearing on the phenotype (i.e. environmental component) and the observing/measuring process often impacts the data value (think about all the different ways to measure organism length or height). Thus we need to have both the process and the location as part of our model.

One could have a model where there is a trait/phenotype value that has contextual information associated with it using an object property like phenotype measured_in some location plus location has_characteristic some characteristic, but based on my experience it makes more sense to model the phenotype, process, and location as classes, so that 1) we can create instances of processes and link multiple data points resulting from the same observing process, 2) we can associate a protocol with the process.

We have been using a basic model that includes:

observing processX has_input some entityY
observing processX has_output some (data item about some traitZ)
traitZ characteristic_of entityY
observing processX occurs_in locationA

We have used the model successfully in multiple projects with both continuous and categorical phenontype data. Sometimes we take a shortcut and do not make an instance of location, but rather just add data properties to the observing process to record the location. The location is actually less important in the model, because you can always query for things that happened in the same place based on the data. This is also essentially the same model that is being used widely in the Earth Sciences community (https://www.w3.org/TR/vocab-ssn/).

Sure, there are ways that you can model healthcare or biodiversity data without including the measurement/observing process, but including the process makes the model more robust and generalizable, because it better reflects reality. I think the fact that at least three different groups (SOSA, OBOE, and BCO) came up with the same model independently is an indication of its value.