Neo4j: Handling Complexes

lindajiawenli commented 1 year ago

Some more details can be found in this pdf: Hairball Complex Notes.pdf

Basic Idea

A complex will be represented by a "hairball" of nodes and edges. Suppose n is the number of elements in the complex. We will be making n nodes (if they do not yet exist) and n(n-1) Complex edges.

Screen Shot 2023-03-17 at 9 38 42 AM

What will a Complex edge look like in Neo4j?

type: COMPLEX id: UUID from factoid (this is the grounding id) complexParticipants: an array of all the node ids (ex. 'ncbigene:201') that are in the complex

What will a Complex interacting with a non-complex look like?

Screen Shot 2023-03-17 at 9 41 56 AM

What will a Complex interacting with a Complex look like?

Screen Shot 2023-03-17 at 9 42 46 AM

lindajiawenli commented 1 year ago

@jvwong Here's a TL:DR version of the notes I sent you

maxkfranz commented 1 year ago

Make sure to consider the ability to reconstruct the left side from the right side, if you decide to store the data this way (e.g. for visualisation use cases).

A ‘complex edge’, as you’ve defined it, lets you know there is a complex, but it is not sufficient on its own to account for, for example, the inter-complex edges in your second example. How do you know the edges are between the two complexes and not between the individual proteins?

lindajiawenli commented 1 year ago

@maxkfranz You're right, thank you-that's a good point.

I'm thinking I can add a property in all edges: participantTypes that is:

'noncomplex-to-noncomplex' by default
'complex-to-complex' when both participants are complexes
'complex-to-noncomplex' when the source is a complex
'noncomplex-complex' when the source is a non-complex

Jeff and I discussed on Wednesday that the "hairball" method of exploding complexes is probably the best way to handle things, even if it is pretty inefficient. It's especially helpful because sometimes there are interactions between molecules in the complex like this:

Screen Shot 2023-03-17 at 11 54 07 AM

and molecules in complexes that individually interact with other molecules:

Screen Shot 2023-03-17 at 11 55 49 AM

lindajiawenli commented 1 year ago

ISSUE:

Making the grounding id of Complex edges a unique UUID means that is there exist 2 different documents that involve the same complex (call it complex-1. complex-1 is a complex consisting of entityA, entityB and entityC), there is going to be duplicate edges:

SmartSelect_20230320_093702_OneNote

That is, the orange and pink edges will hold the exact same information.

PROPOSED CHANGE:

The id of Complex edges representing complex-1 will be "entityA+entityB+entityC" instead of a UUID. This id will be made using some stable sort (ex. alphabetical order of entity ids).

A Complex edge with id: 'A+B+C', sourceId: 'A', targetId: 'B', allParticipants: ['A', 'B', 'C'] will be made if there exists no Complex edge between A and B with the id 'A+B+C'.

Q: Doesn't the id have to be unique? A: No, not necessarily. The ids of Interaction edges are because we don't want to lose information (if one doc talks about interaction W between geneX and geneY and another doc talks about interaction G between geneX and geneY, we want 2 edges. One with information on interaction W, and one on interaction G)

jvwong commented 1 year ago

What would the "after" case for the example network look like? Also, do we need to have two edges for each relation?

lindajiawenli commented 1 year ago

@maxkfranz You're right, thank you-that's a good point.

I'm thinking I can add a property in all edges: participantTypes that is:

'noncomplex-to-noncomplex' by default

'complex-to-complex' when both participants are complexes

'complex-to-noncomplex' when the source is a complex

'noncomplex-complex' when the source is a non-complex

EDIT:

'noncomplex-to-noncomplex' by default
'complex_A_Id-to-complex_B_Id' when the interaction is between complex A and complex B
'complex_A_Id-to-noncomplex' when the interaction is between complex A and a non-complex
'noncomplex-to-complex_A_Id' when the interaction is between a non-complex and complex A

lindajiawenli commented 1 year ago

What would the "after" case for the example network look like? Also, do we need to have two edges for each relation?

@jvwong I suppose we could have something like this instead, since my searchByMoleculeId() returns all edges regardless of direction:

SmartSelect_20230320_105554_OneNote

Would need to be careful sorting the participants in a particular order and making the edges in a particular order, otherwise the following three would be considered different complexes:

SmartSelect_20230320_110042_OneNote

lindajiawenli commented 1 year ago

The network might look something like this:

SmartSelect_20230320_105448_OneNote

Orange edges are Interaction edges, Pink Edges are Complex edges.

Pink edges for the complex with molecules A, B and C would have: id: 'ncbigene:A+ncbigene:B+ncbigene:C' allParticipants: [ncbigene:A, ncbigene:B, ncbigene:C]

Orange edges for the top interaction would have: all the regular information for an edge that you'd expect participantTypes: instead of the default, which is 'noncomplex-to-noncomplex' it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-noncomplex'

Orange edges for the bottom interaction would have: all the regular information for an edge that you'd expect participantTypes: instead of the default, which is 'noncomplex-to-noncomplex' it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-ncbigene:X+ncbigene:Y'

etc.

Note: ncbigene OR chebi, I just type ncbigene because I'm lazy

jvwong commented 1 year ago

Pink edges for the complex with molecules A, B and C would have: id: 'ncbigene:A+ncbigene:B+ncbigene:C' allParticipants: [ncbigene:A, ncbigene:B, ncbigene:C]

Orange edges for the top interaction would have: all the regular information for an edge that you'd expect participantTypes: instead of the default, which is 'noncomplex-to-noncomplex' it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-noncomplex'

Orange edges for the bottom interaction would have: all the regular information for an edge that you'd expect participantTypes: instead of the default, which is 'noncomplex-to-noncomplex' it would have 'ncbigene:A+ncbigene:B+ncbigene:C-to-ncbigene:X+ncbigene:Y'

etc.

Note: ncbigene OR chebi, I just type ncbigene because I'm lazy

Comments around naming:

Use component: [] over allParticipants: []
I suppose searchByMoleculeId(id) could be neighbourhood(id), but not a big deal.

maxkfranz commented 1 year ago

Food for thought: Let's try to simplify as much as possible.

(1) The edges that indicate the existence of a complex need only be undirected. E.g. complex ABC only needs three edges (i.e. A-B, A-C, B-C). The direction of those three edges doesn't matter, since they're undirected. Having both A->B and B->A isn't more meaningful.

(2) Instead of a single combinatorial enum with strings you have to parse, it's probably simpler to separate things out. One field, one thing.

You already have:

source: the source node ID (always a single node)
target: the target node ID (always a single node)

You could simply add these for when at least one participant is a complex:

sourceComplex: the ID of the source complex (like the complex's unique factoid ID), null if the factoid interaction source isn't the complex/compound itself
targetComplex: "

You can use another field for the fakes edges that simulate a compound node itself:

complex: the factoid ID of the complex that this edge represents

You need to use the factoid complex node ID for these sorts of fields, because it's unique. The point of storing this extra information is to be able to merge the fake edges together so you could reconstruct the original left-side picture.

Jeff's motivating examples don't strictly need any special handling, since they're already between particular proteins (i.e. no compound/complex node is a direct participant). If you really wanted, you could add two more fields to be explicit that the protein is within a complex:

withinSourceComplex: lets you know the source is within a particular complex but only the protein is the participant
withinTargetComplex: "

Or we could not store any complex-specific information at all. Just keep in mind that the data would only be useful for non-visualisation use cases, and it may block some other use cases as well. No one wants to see hairballs. For a lot of app use cases, you don't need to see the network anyway: You just want something like 'give me the top 10 genes that interact with X (and maybe new papers about them)'. For the claims comparison use case, being able to reconstruct the original information may be important.

jvwong commented 1 year ago

Or we could not store any complex-specific information at all. Just keep in mind that the data would only be useful for non-visualisation use cases, and it may block some other use cases as well. No one wants to see hairballs. For a lot of app use cases, you don't need to see the network anyway: You just want something like 'give me the top 10 genes that interact with X (and maybe new papers about them)'. For the claims comparison use case, being able to reconstruct the original information may be important.

We mentioned this last week - The emphasis should be discovering information, particularly across different documents. With that in mind, we shouldn't try to provide both a completely detailed/faithful representation and useful search.

lindajiawenli commented 1 year ago

@maxkfranz

Food for thought: Let's try to simplify as much as possible.

(1) The edges that indicate the existence of a complex need only be undirected.

Unfortunately Neo4j does not support undirected edges (all edges must have exactly one direction), otherwise I would do that in a heartbeat.

(2) Instead of a single combinatorial enum with strings you have to parse, it's probably simpler to separate things out.

Just to be clear, you meant something like this?

SmartSelect_20230320_131652_OneNote

@jvwong

The emphasis should be discovering information, particularly across different documents. With that in mind, we shouldn't try to provide both a completely detailed/faithful representation and useful search.

@maxkfranz

Or we could not store any complex-specific information at all

Does this mean I should do away with the pink edges/Complex edges? What about the documents that are just complexes? Ex.

Screen Shot 2023-03-20 at 1 22 14 PM

maxkfranz commented 1 year ago

(1) If it supports directed edges, then it supports undirected edges. An undirected edge is just a directed edge where you don't care about the direction, like a don't care in circuits.

(2) If we don't care about visualisation or claims for this project, then you can still simplify much more than what you've all outlined before.

Don't store any extra fields at all. No component. No participant types. Nothing. Call all of the magenta edges just regular 'binding' interactions. Call the orange edges basically just combinatorial duplicates of the original. And then call it a day.

@lindajiawenli, your ZHX2-HIF1a example would then just be three elements in the db:

node zhx2
node hif1a
edge of type binding between zhx2 and hif1a

lindajiawenli commented 1 year ago

Okay, I've drafted a new plan to deal with complexes using all the feedback I've received thus far- please let me know if there are any changes I should make.

Case #	Biofactoid Visualization	Proposed Neo4j Description	Description
1			Edge (1) and (2) have `id: factoidUUID-of-complex`. They have `type: 'complex'` instead of the usual 'phosphorylation', 'binding' etc. for non-complexes. Edge (1) has `components: [ ncbigene:A, ncbigene:B ]` instead of `null` for non-complexes. Similar with Edge (2)
2			Edge (1) is a typical interaction edge. `component: null` and `type: 'phosphorylation'` or similar. See above for description of Edge (2)
3			Edge (1) is a typical interaction edge. `component: null` and `type: 'phosphorylation'` or similar.
4			The Edges (1) (2) and (3) have `sourceComplex: factoidUUID-of-complex` and `targetComplex: null`. `sourceComplex: null` and `targetComplex: null` are the defaults
5			I am debating whether to go with Option A or Option B. Option A will have far fewer edges but I am concerned about putting to much "weight" on the X node. The edges going between the two complexes will have non-null `sourceComplex` and `targetComplex` fields that consist of the factoid UUIDs for the respective complexes.

I decided not to do away completely with the extra fields because I think it would be nice for the researchers to at least know that there is a complex when they search for Gene A etc., and what molecules that complex includes.

There is no longer a special "Complex" Edge (though I must say it is easy for me to make them- let me know if you think they're worth it. It would also be helpful to have this if we want to later implement some filter that says "Give me genes that are in complexes with Gene A" or similar. I don't know if that's something a user would ever want though).

Number of edges has been cut considerably.

maxkfranz commented 1 year ago

It’s great that you’re elaborating the different cases. We can discuss these tomorrow if you like.

Re. whether we support mapping back to the vis.: Six of one. There are pros and cons either way (e.g. maybe claims are better handled in a different db in a different biofactoid project). Either way, best to simplify where possible.

It would be useful to think about the different cases and how they might apply to different user scenarios.

On Mar 20, 2023, at 15:58, Linda (Jia Wen) Li @.***> wrote:

Okay, I've drafted a new plan to deal with complexes using all the feedback I've received thus far- please let me know if there are any changes I should make.

Case # Biofactoid Visualization Proposed Neo4j Description Description 1 https://user-images.githubusercontent.com/66929920/226441790-61da0495-16a8-4f5c-a1d5-ed91ac0f760d.jpg https://user-images.githubusercontent.com/66929920/226442548-b9e01fbd-5992-4d80-acbc-f4fc4d801fcc.jpg Edge (1) and (2) have id: factoidUUID-of-complex. They have type: 'complex' instead of the usual 'phosphorylation', 'binding' etc. for non-complexes. Edge (1) has components: [ ncbigene:A, ncbigene:B ] instead of null for non-complexes. Similar with Edge (2) 2 https://user-images.githubusercontent.com/66929920/226444196-3858475c-9014-4256-8f8f-2c4ad5a031cf.jpg https://user-images.githubusercontent.com/66929920/226444364-fdd3e0f3-7d13-499b-946c-a9e22d9522e3.jpg Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. See above for description of Edge (2) 3 https://user-images.githubusercontent.com/66929920/226444982-2a06bd61-0124-4b90-9ab7-56ffc32cff72.jpg https://user-images.githubusercontent.com/66929920/226445116-78a1573b-8a60-4898-8fc2-938927b8807c.jpg Edge (1) is a typical interaction edge. component: null and type: 'phosphorylation' or similar. 4 https://user-images.githubusercontent.com/66929920/226449429-bb0cdab7-ed79-4419-8d31-62eb8491e32c.jpg https://user-images.githubusercontent.com/66929920/226449650-5c37d3c2-9e4f-4dc3-8939-18a148a5e565.jpg The Edges (1) (2) and (3) have sourceComplex: factoidUUID-of-complex and targetComplex: null. sourceComplex: null and targetComplex: null are the defaults 5 https://user-images.githubusercontent.com/66929920/226445343-207a53db-f142-4b50-8f4c-3403861763b9.jpg https://user-images.githubusercontent.com/66929920/226445419-84eebf8f-4c1d-4567-952f-36c0cdc8884c.jpg I am debating whether to go with Option A or Option B. Option A will have far fewer edges but I am concerned about putting to much "weight" on the X node. The edges going between the two complexes will have non-null sourceComplex and targetComplex fields that consist of the factoid UUIDs for the respective complexes. I decided not to do away completely with the extra fields because I think it would be nice for the researchers to at least know that there is a complex when they search for Gene A etc., and what molecules that complex includes.

There is no longer a special "Complex" Edge (though I must say it is easy for me to make them- let me know if you think they're worth it. It would also be helpful to have this if we want to later implement some filter that says "Give me genes that are in complexes with Gene A" or similar. I don't know if that's something a user would ever want though).

Number of edges has been cut considerably.

— Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/factoid/issues/1146#issuecomment-1476849791, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHRO457Y26LMMOLBNMPCETW5CZHRANCNFSM6AAAAAAV6RVOL4. You are receiving this because you were mentioned.

PathwayCommons / factoid