Closed lindajiawenli closed 1 year ago
@jvwong Here's a TL:DR version of the notes I sent you
Make sure to consider the ability to reconstruct the left side from the right side, if you decide to store the data this way (e.g. for visualisation use cases).
A ‘complex edge’, as you’ve defined it, lets you know there is a complex, but it is not sufficient on its own to account for, for example, the inter-complex edges in your second example. How do you know the edges are between the two complexes and not between the individual proteins?
@maxkfranz You're right, thank you-that's a good point.
I'm thinking I can add a property in all edges: participantTypes
that is:
'noncomplex-to-noncomplex'
by default'complex-to-complex'
when both participants are complexes'complex-to-noncomplex'
when the source is a complex'noncomplex-complex'
when the source is a non-complexJeff and I discussed on Wednesday that the "hairball" method of exploding complexes is probably the best way to handle things, even if it is pretty inefficient. It's especially helpful because sometimes there are interactions between molecules in the complex like this:
and molecules in complexes that individually interact with other molecules:
Making the grounding id of Complex edges a unique UUID means that is there exist 2 different documents that involve the same complex (call it complex-1
. complex-1
is a complex consisting of entityA
, entityB
and entityC
), there is going to be duplicate edges:
That is, the orange and pink edges will hold the exact same information.
The id of Complex edges representing complex-1
will be "entityA+entityB+entityC"
instead of a UUID. This id will be made using some stable sort (ex. alphabetical order of entity ids).
A Complex edge with id: 'A+B+C'
, sourceId: 'A'
, targetId: 'B'
, allParticipants: ['A', 'B', 'C']
will be made if there exists no Complex edge between A and B with the id 'A+B+C'.
Q: Doesn't the id have to be unique? A: No, not necessarily. The ids of Interaction edges are because we don't want to lose information (if one doc talks about interaction W between geneX and geneY and another doc talks about interaction G between geneX and geneY, we want 2 edges. One with information on interaction W, and one on interaction G)
What would the "after" case for the example network look like? Also, do we need to have two edges for each relation?
@maxkfranz You're right, thank you-that's a good point.
I'm thinking I can add a property in all edges:
participantTypes
that is:
'noncomplex-to-noncomplex'
by default'complex-to-complex'
when both participants are complexes'complex-to-noncomplex'
when the source is a complex'noncomplex-complex'
when the source is a non-complex
'noncomplex-to-noncomplex'
by default'complex_A_Id-to-complex_B_Id'
when the interaction is between complex A and complex B'complex_A_Id-to-noncomplex'
when the interaction is between complex A and a non-complex'noncomplex-to-complex_A_Id'
when the interaction is between a non-complex and complex AWhat would the "after" case for the example network look like? Also, do we need to have two edges for each relation?
@jvwong
I suppose we could have something like this instead, since my searchByMoleculeId()
returns all edges regardless of direction:
Would need to be careful sorting the participants in a particular order and making the edges in a particular order, otherwise the following three would be considered different complexes:
The network might look something like this:
Orange edges are Interaction edges, Pink Edges are Complex edges.
Pink edges for the complex with molecules A, B and C would have:
id:
'ncbigene:A+ncbigene:B+ncbigene:C'
allParticipants:
[ncbigene:A, ncbigene:B, ncbigene:C]
Orange edges for the top interaction would have:
all the regular information for an edge that you'd expect
participantTypes:
instead of the default, which is 'noncomplex-to-noncomplex'
it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-noncomplex'
Orange edges for the bottom interaction would have:
all the regular information for an edge that you'd expect
participantTypes:
instead of the default, which is 'noncomplex-to-noncomplex'
it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-ncbigene:X+ncbigene:Y'
etc.
Note: ncbigene OR chebi, I just type ncbigene because I'm lazy
Pink edges for the complex with molecules A, B and C would have:
id:
'ncbigene:A+ncbigene:B+ncbigene:C'
allParticipants:
[ncbigene:A, ncbigene:B, ncbigene:C]
Orange edges for the top interaction would have: all the regular information for an edge that you'd expect
participantTypes:
instead of the default, which is'noncomplex-to-noncomplex'
it would have'ncbigene:A+ncbigene:B+ncbigene:C-to-noncomplex'
Orange edges for the bottom interaction would have: all the regular information for an edge that you'd expect
participantTypes:
instead of the default, which is'noncomplex-to-noncomplex'
it would have'ncbigene:A+ncbigene:B+ncbigene:C-to-ncbigene:X+ncbigene:Y'
etc.
Note: ncbigene OR chebi, I just type ncbigene because I'm lazy
Comments around naming:
component: []
over allParticipants: []
searchByMoleculeId(id)
could be neighbourhood(id)
, but not a big deal.Food for thought: Let's try to simplify as much as possible.
(1) The edges that indicate the existence of a complex need only be undirected. E.g. complex ABC only needs three edges (i.e. A-B, A-C, B-C). The direction of those three edges doesn't matter, since they're undirected. Having both A->B and B->A isn't more meaningful.
(2) Instead of a single combinatorial enum with strings you have to parse, it's probably simpler to separate things out. One field, one thing.
You already have:
You could simply add these for when at least one participant is a complex:
You can use another field for the fakes edges that simulate a compound node itself:
You need to use the factoid complex node ID for these sorts of fields, because it's unique. The point of storing this extra information is to be able to merge the fake edges together so you could reconstruct the original left-side picture.
Jeff's motivating examples don't strictly need any special handling, since they're already between particular proteins (i.e. no compound/complex node is a direct participant). If you really wanted, you could add two more fields to be explicit that the protein is within a complex:
Or we could not store any complex-specific information at all. Just keep in mind that the data would only be useful for non-visualisation use cases, and it may block some other use cases as well. No one wants to see hairballs. For a lot of app use cases, you don't need to see the network anyway: You just want something like 'give me the top 10 genes that interact with X (and maybe new papers about them)'. For the claims comparison use case, being able to reconstruct the original information may be important.
Or we could not store any complex-specific information at all. Just keep in mind that the data would only be useful for non-visualisation use cases, and it may block some other use cases as well. No one wants to see hairballs. For a lot of app use cases, you don't need to see the network anyway: You just want something like 'give me the top 10 genes that interact with X (and maybe new papers about them)'. For the claims comparison use case, being able to reconstruct the original information may be important.
We mentioned this last week - The emphasis should be discovering information, particularly across different documents. With that in mind, we shouldn't try to provide both a completely detailed/faithful representation and useful search.
@maxkfranz
Food for thought: Let's try to simplify as much as possible.
(1) The edges that indicate the existence of a complex need only be undirected.
Unfortunately Neo4j does not support undirected edges (all edges must have exactly one direction), otherwise I would do that in a heartbeat.
(2) Instead of a single combinatorial enum with strings you have to parse, it's probably simpler to separate things out.
Just to be clear, you meant something like this?
@jvwong
The emphasis should be discovering information, particularly across different documents. With that in mind, we shouldn't try to provide both a completely detailed/faithful representation and useful search.
@maxkfranz
Or we could not store any complex-specific information at all
Does this mean I should do away with the pink edges/Complex edges? What about the documents that are just complexes? Ex.
(1) If it supports directed edges, then it supports undirected edges. An undirected edge is just a directed edge where you don't care about the direction, like a don't care in circuits.
(2) If we don't care about visualisation or claims for this project, then you can still simplify much more than what you've all outlined before.
Don't store any extra fields at all. No component. No participant types. Nothing. Call all of the magenta edges just regular 'binding' interactions. Call the orange edges basically just combinatorial duplicates of the original. And then call it a day.
@lindajiawenli, your ZHX2-HIF1a example would then just be three elements in the db:
Okay, I've drafted a new plan to deal with complexes using all the feedback I've received thus far- please let me know if there are any changes I should make.
Case # | Biofactoid Visualization | Proposed Neo4j Description | Description |
---|---|---|---|
1 | Edge (1) and (2) have id: factoidUUID-of-complex . They have type: 'complex' instead of the usual 'phosphorylation', 'binding' etc. for non-complexes. Edge (1) has components: [ ncbigene:A, ncbigene:B ] instead of null for non-complexes. Similar with Edge (2) |
||
2 | Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. See above for description of Edge (2) |
||
3 | Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. |
||
4 | The Edges (1) (2) and (3) have sourceComplex: factoidUUID-of-complex and targetComplex: null . sourceComplex: null and targetComplex: null are the defaults |
||
5 | I am debating whether to go with Option A or Option B. Option A will have far fewer edges but I am concerned about putting to much "weight" on the X node. The edges going between the two complexes will have non-null sourceComplex and targetComplex fields that consist of the factoid UUIDs for the respective complexes. |
I decided not to do away completely with the extra fields because I think it would be nice for the researchers to at least know that there is a complex when they search for Gene A etc., and what molecules that complex includes.
There is no longer a special "Complex" Edge (though I must say it is easy for me to make them- let me know if you think they're worth it. It would also be helpful to have this if we want to later implement some filter that says "Give me genes that are in complexes with Gene A" or similar. I don't know if that's something a user would ever want though).
Number of edges has been cut considerably.
It’s great that you’re elaborating the different cases. We can discuss these tomorrow if you like.
Re. whether we support mapping back to the vis.: Six of one. There are pros and cons either way (e.g. maybe claims are better handled in a different db in a different biofactoid project). Either way, best to simplify where possible.
It would be useful to think about the different cases and how they might apply to different user scenarios.
On Mar 20, 2023, at 15:58, Linda (Jia Wen) Li @.***> wrote:
Okay, I've drafted a new plan to deal with complexes using all the feedback I've received thus far- please let me know if there are any changes I should make.
Case # Biofactoid Visualization Proposed Neo4j Description Description 1 https://user-images.githubusercontent.com/66929920/226441790-61da0495-16a8-4f5c-a1d5-ed91ac0f760d.jpg https://user-images.githubusercontent.com/66929920/226442548-b9e01fbd-5992-4d80-acbc-f4fc4d801fcc.jpg Edge (1) and (2) have id: factoidUUID-of-complex. They have type: 'complex' instead of the usual 'phosphorylation', 'binding' etc. for non-complexes. Edge (1) has components: [ ncbigene:A, ncbigene:B ] instead of null for non-complexes. Similar with Edge (2) 2 https://user-images.githubusercontent.com/66929920/226444196-3858475c-9014-4256-8f8f-2c4ad5a031cf.jpg https://user-images.githubusercontent.com/66929920/226444364-fdd3e0f3-7d13-499b-946c-a9e22d9522e3.jpg Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. See above for description of Edge (2) 3 https://user-images.githubusercontent.com/66929920/226444982-2a06bd61-0124-4b90-9ab7-56ffc32cff72.jpg https://user-images.githubusercontent.com/66929920/226445116-78a1573b-8a60-4898-8fc2-938927b8807c.jpg Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. 4 https://user-images.githubusercontent.com/66929920/226449429-bb0cdab7-ed79-4419-8d31-62eb8491e32c.jpg https://user-images.githubusercontent.com/66929920/226449650-5c37d3c2-9e4f-4dc3-8939-18a148a5e565.jpg The Edges (1) (2) and (3) have sourceComplex: factoidUUID-of-complex and targetComplex: null. sourceComplex: null and targetComplex: null are the defaults 5 https://user-images.githubusercontent.com/66929920/226445343-207a53db-f142-4b50-8f4c-3403861763b9.jpg https://user-images.githubusercontent.com/66929920/226445419-84eebf8f-4c1d-4567-952f-36c0cdc8884c.jpg I am debating whether to go with Option A or Option B. Option A will have far fewer edges but I am concerned about putting to much "weight" on the X node. The edges going between the two complexes will have non-null sourceComplex and targetComplex fields that consist of the factoid UUIDs for the respective complexes. I decided not to do away completely with the extra fields because I think it would be nice for the researchers to at least know that there is a complex when they search for Gene A etc., and what molecules that complex includes.
There is no longer a special "Complex" Edge (though I must say it is easy for me to make them- let me know if you think they're worth it. It would also be helpful to have this if we want to later implement some filter that says "Give me genes that are in complexes with Gene A" or similar. I don't know if that's something a user would ever want though).
Number of edges has been cut considerably.
— Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/factoid/issues/1146#issuecomment-1476849791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHRO457Y26LMMOLBNMPCETW5CZHRANCNFSM6AAAAAAV6RVOL4. You are receiving this because you were mentioned.
Some more details can be found in this pdf: Hairball Complex Notes.pdf
Basic Idea
A complex will be represented by a "hairball" of nodes and edges. Suppose n is the number of elements in the complex. We will be making n nodes (if they do not yet exist) and n(n-1) Complex edges.
What will a Complex edge look like in Neo4j?
type
: COMPLEXid:
UUID from factoid (this is the grounding id)complexParticipants
: an array of all the node ids (ex. 'ncbigene:201') that are in the complexWhat will a Complex interacting with a non-complex look like?
What will a Complex interacting with a Complex look like?