Closed ppavlidis closed 11 months ago
From what I posted on Slack:
Instead of a bag of word, what about supporting ordered terms to construct (subject, predicate, object) triplets. These would fit in the current data model if we add an order to the factor value's characteristics collection.
Fixing existing bags would be complicated because we would need a way to tell which term is a subject and which term is an object.
For backward compatibility, we could add a flag in the factor value to indicate if it has triplet semantics attached or not
It could scale beyond a single triplet, but is that even a thing we need to consider?
Yeah, the datamodel part isn't hard to envision. I am more concerned about how to make this work in the UI (and the overall cost-benefit of setting this up). The curators' thoughts on this have been solicited.
As for backporting: Since there are a large number of individual cases covered by a small number of patterns, I suspect that doing provisional mapping with manual screening will get us most of the way there. For example, the category Dose with free text occurring with a Treatment from CHEBI is one such rule.
Of course, the possibility of being able to apply such rules raises the idea of just doing this 'post-hoc' without any curation or data model change (i.e., just change the way the data are presented on the fly), but it would be better to get it done properly.
What I would suggest for the data model is quite a bit like your idea but a little simpler, that a Characteristic can have an associated "modifier" Characteristic and the nature of the modification could be enumerated as part of the Characteristic (if the semantics of the modifier actually matters). That's somewhat constraining (only one modifier, though they could be chained...), but it seems likely satisfy the main use case. Example: Treatment DrugX is modified by Dose Yug/kg. That binds them together and also enforces an ordering for display ("Yug/kg DrugX"). I could well be ignoring some important angle but I'd start there; counter-examples that would motivate something more are welcome.
The situation is usually going to be simple of there are <3 characteristics for a given factorvalue.
The trivial case is N(characteristics) = 1, not discussing that. For N(char) = 2, this will typically be of the form "thing" and "more information about that thing", so it is not so difficult to handle.
When there are 3 (or more), the relationships among the terms become increasingly harder to generalize.
There are currently over 4000 factor values that have more than 2 characteristics. About 1200 have more than 3. 175 have more than 5. So this is not a rare situation.
In many cases, these can be considered as two parallel one-to-one relations like "drug A at this dose AND also drug B at this other dose".
Some extreme cases: GSE10784: a very diligent curator annotated all the genes in a deleted chromosome region as characteristics of that factorvalue AND also put a free-text annotation for the deletion. So there are 29 characteristics in this factor value. We are not consistent in our annotation of things like this! "Trisomy 21" does not follow with a list of hundreds of genes on chromosome 21 that are duplicated. This is similar as is this one - especially the latter seems like it really is appropriate, though.
I'm currently exploring the possibility of using OWL/RDF for annotating a factor value. I quickly came to the realization that what we are doing is essentially instantiating classes (i.e. a treatment, a genotype, a gene fusion, etc.) and defining a bunch of relations between the resulting instances
The term that act as "modifier" can be replaced by object properties from the Relation Ontology. For example, "treatment with dose doxycycline":
The classes we need to instantiate: treatment, doxycycline and "5mg" (as a free-text term).
Or a gene fusion:
The classes we have to instantiate here: genotype, gene fusion, gene A, gene B and gene C.
Needless to say, all the examples above are a natural fit for RDF/OWL. It could be stored as a XML blob in the FACTOR_VALUE
table and Jena could be used to generate it and later on parse it. It's also possible to store them in a relational database.
create table FACTOR_VALUE_TRIPLETS (
FACTOR_VALUE_FK int
SUBJECT_FK int references CHARACTERISTIC(ID),
PREDICATE_URI varchar(255),
OBJECT_FK int references CHARACTERISTIC(ID),
UNIQUE(FACTOR_VALUE_FK, SUBJECT_FK, PREDICATE_URI, OBJECT_FK)
);
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:j.0="http://purl.obolibrary.org/obo/"
xmlns:j.1="http://www.ebi.ac.uk/efo/"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
<rdf:Description rdf:nodeID="A0">
<j.0:RO_0002374>gene C</j.0:RO_0002374>
<j.0:RO_0002374>gene B</j.0:RO_0002374>
<j.0:RO_0002374>gene A</j.0:RO_0002374>
<rdf:type rdf:resource="http://purl.obolibrary.org/obo/SO_0001565"/>
</rdf:Description>
<rdf:Description rdf:about="http://gemma.msl.ubc.ca/ont/FV_000001">
<j.0:RO_0002200 rdf:nodeID="A0"/>
<rdf:type rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000513"/>
</rdf:Description>
</rdf:RDF>
This can be mapped to the UI by allowing curators to specify triplets. The objects and subjects are all terms in the bag, so there's only a few of them to propose. The predicate could be an autocomplete that would be pre-populated with RO terms.
It might be annoying to have to do that for each factor value though. We could implicitly handle some basic cases based on the classes of the terms.
The has modifier relation might be a good generic way of associating a term to a "modifier".
To make this more concrete, the RDF example above would have the following triplets:
C_{id}
and FV_{id}
are referring to in-database records. If those were stored in the database, foreign key to characteristics would be used.
I created a repository to work on a UI prototype which can be previewed at https://pavlidislab.github.io/gemma-fv-ui-prototype/.
I've been investigating the challenge of resolving these types of relationships for factorvalues that are already in the system.
For a somewhat random sample of experiments, when a factorvalue has just two characteristics, I can resolve this reasonably in most, but not all cases.
The problem is the factorvalues that have more than two characteristics, and the ones which are not easily resolved.
Examples: (the formatting is an artifact of my test code)
Factorvalue has two OrganismParts
FactorValue 204594: EFO_0000635:somatosensory cortex | motor cortex |
- c - organism part: somatosensory cortex http://purl.obolibrary.org/obo/UBERON_0008930
- c - organism part: motor cortex http://purl.obolibrary.org/obo/UBERON_0001384
confusing mixtures of things with unclear relationships
FactorValue 119735: treatment:schistosomiasis | pulmonary hypertension |
- c - treatment: schistosomiasis http://purl.obolibrary.org/obo/MONDO_0015254
- c - disease: pulmonary hypertension http://purl.obolibrary.org/obo/MONDO_0005149
FactorValue 129217: treatment:glial cell | C57BL/6 |
- c - treatment: glial cell http://purl.obolibrary.org/obo/CL_0000125
- c - strain: C57BL/6 http://gemma.msl.ubc.ca/ont/TGEMO_00016
Controls that are described with two characteristics. This may be addressable.
FactorValue 205437: genotype:wild type genotype | control |
- c - genotype: wild type genotype http://www.ebi.ac.uk/efo/EFO_0005168
- c - genotype: control http://www.ebi.ac.uk/efo/EFO_0001461
Organism part with a location modifier, but determining this programatically may not be easy
FactorValue 137484: organism part:Ammon's horn | dorsal |
- c - organism part: Ammon's horn http://purl.obolibrary.org/obo/UBERON_0001954
- c - organism part: dorsal http://www.ebi.ac.uk/efo/EFO_0001656
Basically this is going to be a long list of special cases, and this doesn't even touch ones with >2 characteristics. There are >4000 factorvalues in the latter category. (>20,000 have exactly two; about 94,000 have just one, thankfully).
I guess we can mark this as completed!
When we originally designed the curation tools, we experimented with allowing more complex relationships among terms used to describe factorvalues. But we decided the implied complexity of the user interface, and the lack of sufficient motivating use-cases, led us to compromise on a bag-of-words approach.
To give one relevant situation, factor values describing application of a drug might have two Characteristics: one CHEBI term for the drug (category=treatment), and a second free text describing the dosage (category=dose). (it was another compromise to not have more formal description of dosages, and I am not proposing any change there; we do have guidelines for how to standardize the free text).
However, this creates some complications for downstream use. While a human can figure out what is intended, the lack of a formal relationship between the terms can create ambiguity in dealing with the separate terms in computational analyses. And it is just unaesthetic.
I am using this issue as a forum to discuss how we could improve things.
It is important that we don't add substantial curation burden, or negatively impact usability of the curation interface. Any solution has to be minimally intrusive and work within the framework we have (I don't want to have to rewrite the entire curation interface).
We would need a way to backport existing bags-of-words with minimal manual work (there are at least 9000 drug dosage statements in the system, for example).
A minimal necessity of a solution would be a way to connect Characteristics with each other more directly than by being related to the same FactorValue. We would need a set of semantic relations that can be added. Terms from the Relation Ontology might be considered, though I'm not sure it actually covers what we need.
Keeping this simple (constraining it to just one-to-one relations rather than arbitrarily complex graphs) would be starting point. We would need specific use cases to motivate anything more complicated.
The curator could be offered a button to associate a term with another, and would get prompted to select the other term and the type of relationship. Trying to infer these automatically to limit the amount of mouse-clicks would be smart. For example, if there is a treatment and a dose, the relationship should be proposed automatically
The curator interface needs to indicate the existence of the relationship.
Outside of the curation interface, terms bound by these relationships would be displayed together in the GUI and also grouped in serialized forms. We would use rules entailed by the relationships to do this, so we get "drug dose" not "dose drug".
Specific use scenarios (please add to this list): (the relationship concepts I am mentioning do not necessarily exist in any ontology)