SIF querying support in cPath2 and Paxtools

GoogleCodeExporter commented 9 years ago

Current framework runs a graph query, and if user asks SIF, the result is 
converted to SIF. The problem here is that SIF conversion is not a very fast 
operation by nature. There are many patterns to search for.

Ideally, instead of running a SIF query on the BioPAX model, we should run it 
on the large SIF network, using the desired SIF types. This requires to keep 
the SIF network in memory in cPath2, and we need Paxtools to support running 
queries on SIF network.

Original issue reported on code.google.com by ozgunba...@gmail.com on 4 Mar 2015 at 9:10

IgorRodchenkov commented 9 years ago

FYI (recently came across), The NDEx project uses some if not all PC2 v6 data, in SIF format (e.g., the normalized/merged NCI PID, IntAct, Panthr, etc.). See, e.g., http://www.ndexbio.org/#/network/6f259e96-c4ad-11e4-bcc4-000c29cb28fb or search for something (like "brca2") in the top bar. And you can then pick a source network and run queries on it, visualize (I suppose, neighborhood queries; e.g., try "brca2" and depth=1), perhaps, using Cytoscape.js.

IgorRodchenkov commented 7 years ago

Currently, a PC2 graph query accepts different ID types and URIs and uses full-text search and id-mapping to find phys. entities and genes to use as seeds in the BioPAX graph query; the result (sub-network) is then converted to SIF format if required.

I do not see an easy and general solution for SIF based graph queries (in Paxtools) that would allow us to support several input ID types and URIs at the same time in PC2...

@ozgunbabur please have a look. See also BioPAX/Paxtools#21

ozgunbabur commented 7 years ago

If PC stores the big SIF where identifiers are URIs of EntityReferences, then the same id-mapping can be used for SIF queries as well. But this means 3 steps instead of 1 step:

Find related URIs
Do SIF query with URIs
Replace URIs in SIF with the desired ID type (some URIs will not map to a desired ID type and their relations will be dropped from the results).

ozgunbabur commented 7 years ago

I just noticed that my above comment is in great error. Yes, full text search and ID mapping does not go well with SIF graphs. In fact, the SIF structure can be different according to the different ID types that are used. I mean SIF graph with gene symbols can have a different structure than the SIF graph with UniProt IDs. Same thing for the SIF graph with URIs. Those different structures will potentially output different graph query results.

For instance let's say there is a path from A to B in the SIF with gene symbols of length 2 (A -> X -> B). This path may be missing in the SIF graph with URIs because there can be two different URIs corresponding to X (lets say URI-X1 and URI-X2). So the paths URI-A -> URI-X1 and URI-X2 -> URI-B will be two disconnected paths in this graph and the query won't find a path from URI-A to URI-B. It is better to not use URIs in SIF graphs.

On the other hand, I don't consider that a big problem. It's OK if full text search and ID mapping does not work for SIF queries. We can say that we support either gene symbols or UniProt IDs for SIF queries. And we can also tell users that they can end up with different query results when they use different ID options because gene symbol to UniProt mapping is not one-to-one. That is just a fact of life.

IgorRodchenkov commented 7 years ago

Also, how about if we query our (PC8) SIF that uses UniProt/ChEBI IDs rather than the HGNC or URI based SIF data archives? This is because mapping of query input IDs is much easier and faster to UniProt (by cpath2/PC2 design). Then, if a user requested HGNC Symbols SIF as the output, we'd map (expand) each SIF entry in there (interaction) to multiple lines that have the corresponding HGNC symbols (I'd even simply printed them separated with ';' instead of making many new lines, one per gene name...).

On Tue, Nov 1, 2016 at 2:14 PM, Özgün Babur notifications@github.com wrote:

I just noticed that my above comment is in great error. Yes, full text search and ID mapping does not go well with SIF graphs. In fact, the SIF structure can be different according to the different ID types that are used. I mean SIF graph with gene symbols can have a different structure than the SIF graph with UniProt IDs. Same thing for the SIF graph with URIs. Those different structures will potentially output different graph query results.

For instance let's say there is a path from A to B in the SIF with gene symbols of length 2 (A -> X -> B). This path may be missing in the SIF graph with URIs because there can be two different URIs corresponding to X (lets say URI-X1 and URI-X2). So the paths URI-A -> URI-X1 and URI-X2 -> URI-B will be two disconnected paths in this graph and the query won't find a path from URI-A to URI-B. It is better to not use URIs in SIF graphs.

On the other hand, I don't consider that a big problem. It's OK if full text search and ID mapping does not work for SIF queries. We can say that we support either gene symbols or UniProt IDs for SIF queries. And we can also tell users that they can end up with different query results when they use different ID options because gene symbol to UniProt mapping is not one-to-one. That is just a fact of life.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/cpath2/issues/202#issuecomment-257647223, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8fwUGxji1dWJ3f9N_KCd-znmWOIbiuks5q54GYgaJpZM4EUNPa .

PathwayCommons / cpath2

SIF querying support in cPath2 and Paxtools #202