SciGraph / golr-loader

Convert SciGraph queries into json that can be loaded by Golr
Apache License 2.0
1 stars 3 forks source link

Indexing behavior when a sub-object pair is linked by multiple relations #35

Open kshefchek opened 7 years ago

kshefchek commented 7 years ago

Consider the following pattern:

(subject:gene)<-[has_locus]-(variant)-[relation]->(object:disease)

Where relation is one of:

  1. pathogenic
  2. likely pathogenic
  3. has phenotype
  4. marker/mechanism
  5. contributes to ...

In many cases, multiple variants of a single gene are linked to a disease via multiple relations (commonly pathogenic and likely pathogenic). Currently, the solr loader seems to pick a relation at random (although this may not be the case and it may in fact be deterministic for a given db).

This is also an issue with combining orthology statements from multiple sources (panther and zfin) where panther specifies whether two orthologs have a 1 to 1 relationship whereas zfin does not.

One option is to store the set of relations linking two nodes. Another option would be to configure a relation priority, where the relation with the highest priority is designated while the others are retrievable via the evidence graph.

@mbrush @selewis @cmungall thoughts?

cmungall commented 5 years ago

Why not just make different associations? Doesn't each have it's own evidence/provenance etc?

kshefchek commented 5 years ago

@cmungall could you clarify your suggestion? One document per association could lead to a lot of additional documents since we infer across variants; some genes have a lot of causal variants for a disease (eg BRCA). One document per relation is possible, but IMO we'll still be showing too much duplication to the user (or operating on it in ontobio).

As a potential workaround for G2D, I have split up causal vs non causal associations. This way they can be displayed separately to our end users. The downside is that there will be some redundancy between the two gene-disease lists, as CTD and Coriell will often report he causal gene in additional to those with more hypothetical evidence.

causual g2d

hypothetical g2d - gwas, ctd, coriell

cmungall commented 5 years ago

I think your solution is on the right lines. I think having a smaller set of relationship types where we separate evidence from relation ("likely pathogenic" should not be a relation) should in theory mean high quality resources should not generally conflict

kshefchek commented 5 years ago

The relation that maps to ACMG likely_pathogenic is all in yaml file(s), so it's an easy change when we're ready.

Thinking about this from the UI perspective, should we have one list of causal genes, and one list of all genes so that the latter list fully subsumes the list of causal genes (instead of partially overlapping sets)?

cmungall commented 5 years ago

I don't have strong opinions about the UI so long as it's clear.

I had envisioned on the disease page showing the causal gene prominently (first entry in table, if we have a table view) and others beneath that

On Tue, Feb 19, 2019 at 2:27 PM Kent Shefchek notifications@github.com wrote:

The relation that maps to ACMG likely_pathogenic is all in yaml file(s), so it's an easy change when we're ready.

Thinking about this from the UI perspective, should we have one list of causal genes, and one list of all genes so that the latter list fully subsumes the list of causal genes (instead of partially overlapping sets)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SciGraph/golr-loader/issues/35#issuecomment-465336755, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOaDPFhXkEyHiH6qKvcSheki6TFJjks5vPHpAgaJpZM4QlCgv .

monicacecilia commented 5 years ago

Adding a little reminder that Chris' suggestion is still not implemented. Instead, we have a list of all genes, and the causal gene in this, our favorite example, shows up 6th on the list.

Screen Shot 2019-07-01 at 5 43 17 PM