PavlidisLab / Gemma

Genomics data re-analysis
Apache License 2.0
23 stars 6 forks source link

Support richer semantics for some factorvalue annotations #705

Closed ppavlidis closed 11 months ago

ppavlidis commented 1 year ago

When we originally designed the curation tools, we experimented with allowing more complex relationships among terms used to describe factorvalues. But we decided the implied complexity of the user interface, and the lack of sufficient motivating use-cases, led us to compromise on a bag-of-words approach.

To give one relevant situation, factor values describing application of a drug might have two Characteristics: one CHEBI term for the drug (category=treatment), and a second free text describing the dosage (category=dose). (it was another compromise to not have more formal description of dosages, and I am not proposing any change there; we do have guidelines for how to standardize the free text).

However, this creates some complications for downstream use. While a human can figure out what is intended, the lack of a formal relationship between the terms can create ambiguity in dealing with the separate terms in computational analyses. And it is just unaesthetic.

I am using this issue as a forum to discuss how we could improve things.

Specific use scenarios (please add to this list): (the relationship concepts I am mentioning do not necessarily exist in any ontology)

  1. Drug-dose (e.g. aspirin 1mg/kg) "has-dose" relationship
  2. Gene-genotype (e.g. P53 loss of function) "has-genotype" or "has-manipulation"
  3. Antibody-target (e.g. antibody targeting P53) "has-target" or "binds"?
  4. Cell-state (e.g. activated T-cell - this isn't the best example as there are CL terms but there are other situations that aren't covered) "has-state"
  5. Organismpart-region (e.g. anterior brain stem; again, if there isn't an existing term) "part-of"?
  6. Cancer-metastasis - (e.g. metastasis to kidney) because there aren't always terms for this, which is annoying.
  7. Organismpart-modifier (e.g. liver damage, though this might be too rare to need handling)
  8. Gene-gene fusions
  9. Resistance to a drug - [interesting case ] (https://gemma.msl.ubc.ca/experimentalDesign/showExperimentalDesign.html?eeid=7062)
  10. Treatment with (or of) an organism part that wasn't itself used for sampling of RNA - see here for discussion. e.g. "clamping of aorta" or "exposure of cell to serum".
arteymix commented 1 year ago

From what I posted on Slack:

Instead of a bag of word, what about supporting ordered terms to construct (subject, predicate, object) triplets. These would fit in the current data model if we add an order to the factor value's characteristics collection.

Fixing existing bags would be complicated because we would need a way to tell which term is a subject and which term is an object.

For backward compatibility, we could add a flag in the factor value to indicate if it has triplet semantics attached or not

It could scale beyond a single triplet, but is that even a thing we need to consider?

ppavlidis commented 1 year ago

Yeah, the datamodel part isn't hard to envision. I am more concerned about how to make this work in the UI (and the overall cost-benefit of setting this up). The curators' thoughts on this have been solicited.

As for backporting: Since there are a large number of individual cases covered by a small number of patterns, I suspect that doing provisional mapping with manual screening will get us most of the way there. For example, the category Dose with free text occurring with a Treatment from CHEBI is one such rule.

Of course, the possibility of being able to apply such rules raises the idea of just doing this 'post-hoc' without any curation or data model change (i.e., just change the way the data are presented on the fly), but it would be better to get it done properly.

What I would suggest for the data model is quite a bit like your idea but a little simpler, that a Characteristic can have an associated "modifier" Characteristic and the nature of the modification could be enumerated as part of the Characteristic (if the semantics of the modifier actually matters). That's somewhat constraining (only one modifier, though they could be chained...), but it seems likely satisfy the main use case. Example: Treatment DrugX is modified by Dose Yug/kg. That binds them together and also enforces an ordering for display ("Yug/kg DrugX"). I could well be ignoring some important angle but I'd start there; counter-examples that would motivate something more are welcome.

ppavlidis commented 1 year ago

The situation is usually going to be simple of there are <3 characteristics for a given factorvalue.

The trivial case is N(characteristics) = 1, not discussing that. For N(char) = 2, this will typically be of the form "thing" and "more information about that thing", so it is not so difficult to handle.

When there are 3 (or more), the relationships among the terms become increasingly harder to generalize.

There are currently over 4000 factor values that have more than 2 characteristics. About 1200 have more than 3. 175 have more than 5. So this is not a rare situation.

In many cases, these can be considered as two parallel one-to-one relations like "drug A at this dose AND also drug B at this other dose".

Some extreme cases: GSE10784: a very diligent curator annotated all the genes in a deleted chromosome region as characteristics of that factorvalue AND also put a free-text annotation for the deletion. So there are 29 characteristics in this factor value. We are not consistent in our annotation of things like this! "Trisomy 21" does not follow with a list of hundreds of genes on chromosome 21 that are duplicated. This is similar as is this one - especially the latter seems like it really is appropriate, though.

arteymix commented 1 year ago

I'm currently exploring the possibility of using OWL/RDF for annotating a factor value. I quickly came to the realization that what we are doing is essentially instantiating classes (i.e. a treatment, a genotype, a gene fusion, etc.) and defining a bunch of relations between the resulting instances

The term that act as "modifier" can be replaced by object properties from the Relation Ontology. For example, "treatment with dose doxycycline":

The classes we need to instantiate: treatment, doxycycline and "5mg" (as a free-text term).

Or a gene fusion:

The classes we have to instantiate here: genotype, gene fusion, gene A, gene B and gene C.

Needless to say, all the examples above are a natural fit for RDF/OWL. It could be stored as a XML blob in the FACTOR_VALUE table and Jena could be used to generate it and later on parse it. It's also possible to store them in a relational database.

create table FACTOR_VALUE_TRIPLETS (
   FACTOR_VALUE_FK int
   SUBJECT_FK int references CHARACTERISTIC(ID),
   PREDICATE_URI varchar(255),
   OBJECT_FK int references CHARACTERISTIC(ID),
   UNIQUE(FACTOR_VALUE_FK, SUBJECT_FK, PREDICATE_URI, OBJECT_FK)
);

https://github.com/PavlidisLab/Gemma/blob/f1eb24b63b9aac2f7f806f7938cc9b6c166b1aa1/gemma-core/src/test/java/ubic/gemma/core/ontology/FactorValueAsOntologyTest.java#L30-L45

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:j.0="http://purl.obolibrary.org/obo/"
    xmlns:j.1="http://www.ebi.ac.uk/efo/"
    xmlns:owl="http://www.w3.org/2002/07/owl#"
    xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:nodeID="A0">
    <j.0:RO_0002374>gene C</j.0:RO_0002374>
    <j.0:RO_0002374>gene B</j.0:RO_0002374>
    <j.0:RO_0002374>gene A</j.0:RO_0002374>
    <rdf:type rdf:resource="http://purl.obolibrary.org/obo/SO_0001565"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://gemma.msl.ubc.ca/ont/FV_000001">
    <j.0:RO_0002200 rdf:nodeID="A0"/>
    <rdf:type rdf:resource="http://www.ebi.ac.uk/efo/EFO_0000513"/>
  </rdf:Description>
</rdf:RDF>

This can be mapped to the UI by allowing curators to specify triplets. The objects and subjects are all terms in the bag, so there's only a few of them to propose. The predicate could be an autocomplete that would be pre-populated with RO terms.

It might be annoying to have to do that for each factor value though. We could implicitly handle some basic cases based on the classes of the terms.

arteymix commented 1 year ago

The has modifier relation might be a good generic way of associating a term to a "modifier".

arteymix commented 1 year ago

To make this more concrete, the RDF example above would have the following triplets:

image

C_{id} and FV_{id} are referring to in-database records. If those were stored in the database, foreign key to characteristics would be used.

arteymix commented 1 year ago

I created a repository to work on a UI prototype which can be previewed at https://pavlidislab.github.io/gemma-fv-ui-prototype/.

image

ppavlidis commented 1 year ago

I've been investigating the challenge of resolving these types of relationships for factorvalues that are already in the system.

For a somewhat random sample of experiments, when a factorvalue has just two characteristics, I can resolve this reasonably in most, but not all cases.

The problem is the factorvalues that have more than two characteristics, and the ones which are not easily resolved.

Examples: (the formatting is an artifact of my test code)

Factorvalue has two OrganismParts

FactorValue 204594: EFO_0000635:somatosensory cortex | motor cortex | 
 - c - organism part: somatosensory cortex http://purl.obolibrary.org/obo/UBERON_0008930
 - c - organism part: motor cortex http://purl.obolibrary.org/obo/UBERON_0001384

confusing mixtures of things with unclear relationships

FactorValue 119735: treatment:schistosomiasis | pulmonary hypertension | 
 - c - treatment: schistosomiasis http://purl.obolibrary.org/obo/MONDO_0015254
 - c - disease: pulmonary hypertension http://purl.obolibrary.org/obo/MONDO_0005149

FactorValue 129217: treatment:glial cell | C57BL/6 | 
 - c - treatment: glial cell http://purl.obolibrary.org/obo/CL_0000125
 - c - strain: C57BL/6 http://gemma.msl.ubc.ca/ont/TGEMO_00016

Controls that are described with two characteristics. This may be addressable.

FactorValue 205437: genotype:wild type genotype | control | 
 - c - genotype: wild type genotype http://www.ebi.ac.uk/efo/EFO_0005168
 - c - genotype: control http://www.ebi.ac.uk/efo/EFO_0001461

Organism part with a location modifier, but determining this programatically may not be easy

FactorValue 137484: organism part:Ammon's horn | dorsal | 
 - c - organism part: Ammon's horn http://purl.obolibrary.org/obo/UBERON_0001954
 - c - organism part: dorsal http://www.ebi.ac.uk/efo/EFO_0001656

Basically this is going to be a long list of special cases, and this doesn't even touch ones with >2 characteristics. There are >4000 factorvalues in the latter category. (>20,000 have exactly two; about 94,000 have just one, thankfully).

arteymix commented 11 months ago

I guess we can mark this as completed!