PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

Graph queries: no MolecularInteractions, dangling/duplicate proteins #198

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Please follow the topic on the PC developers' forum:
https://groups.google.com/forum/#!topic/pathway-commons-dev/kupVnm5Dugk

This is probably to fix in Paxtools (paxtools-query and sif-converter).

Original issue reported on code.google.com by rod...@gmail.com on 8 Jan 2015 at 10:57

GoogleCodeExporter commented 9 years ago

Original comment by ozgunba...@gmail.com on 9 Jan 2015 at 4:44

GoogleCodeExporter commented 9 years ago
Added a cleanup method is Paxtools to remove disconnected SimplePhysicalEntity 
objects from the results. This should resolve the dangling proteins problem due 
to psi-mi proteins.

There are still things to do. We need to design an undirected version of graph 
queries to be able to support molecular interactions.

Another issue that we should solve is the duplicated proteins from psi-mi data. 
We should merge those into a single protein (at least the ones without any 
feature) per protein reference.

Original comment by ozgunba...@gmail.com on 22 Jan 2015 at 3:49

GoogleCodeExporter commented 9 years ago
Thanks Ozgun,

Good, I'll use the updated paxtools for the nearest cPath2/PC2 build..

Also, I am working to remove duplicate proteins from psi-mi data now...

Original comment by rod...@gmail.com on 22 Jan 2015 at 4:48

GoogleCodeExporter commented 9 years ago
Was the neighborhood type query directed (I think, the default is in most cases 
- using both directions, i.e., no direction technically, no)?

Original comment by rod...@gmail.com on 22 Jan 2015 at 4:50

GoogleCodeExporter commented 9 years ago
It is directed. If limit=1 and direction is bothstream (default parameters), 
then it is not different from an undirected neighborhood query. But if limit > 
1, then you will see the difference even if the direction is bothstream.

Imagine the below graph:

A --> B
C --> B
B --> D
X --> A

A neighborhood query from A will not reach to C no matter what the parameters 
are. That is because there is no directed path from A to C. A neighborhood 
query from A with limit=2 and dir=bothstream will return X --> A --> B --> D.

Original comment by ozgunba...@gmail.com on 22 Jan 2015 at 5:02

GoogleCodeExporter commented 9 years ago
Will the default neighborhood query from A (limit=1) return X --> A --> B?

But I still don't quite understand how the original issue was related to 
directionality...
That default neighborhood query using 'MAX' used to return (well too many 
duplicate but) all MAX proteins but did not return any MI and the second 
participants...
To me this issue was about that we intentionally or unintentionally excluded 
all MIs from results. 

Original comment by rod...@gmail.com on 22 Jan 2015 at 5:12

GoogleCodeExporter commented 9 years ago
Since the whole querying system is based on directed relations, it ignores 
undirected relations like MI.

Yes, the default neighborhood query will return X --> A --> B.

Original comment by ozgunba...@gmail.com on 22 Jan 2015 at 5:17

GoogleCodeExporter commented 9 years ago
Currently it's "fixed" on the test server (at the Gary's lab).

Query 
http://webservice.baderlab.org:48080/graph?source=MAX&datasource=bind&kind=neigh
borhood - now returns no result (because no MIs and no dangling proteins are 
returned anymore).

Query http://webservice.baderlab.org:48080/graph?source=MAX&kind=neighborhood - 
returns some result.

Shall we close this issue (and open a new one about adding MIs back to the 
neighborhood results)?

Original comment by rod...@gmail.com on 5 Feb 2015 at 10:59

GoogleCodeExporter commented 9 years ago
Despite recent changes, also in the psimi-converter, there are still hundreds 
"duplicate" proteins coming from BIND (potentially from any PSI-MI source). 
This is not quite a trivial issue. 

Good news is that, and it was before, most original "Max" PRs were merged into 
a single canonical P61244 PR; bad news is that corresponding Protein objects 
were not merged. 

Apparently, I did not fix this entire problem this time, but now got a better 
understanding... 
In the original BIND XML, there are many identical or almost identical 
participants and interactors that have name "Max" and two xrefs: 
entrezgene/locuslink 4149 and omim 154950, which become RelationshipXref in 
BioPAX. These cannot be UnificationXref (unlike a few other 'Max' cases where 
one of xrefs is RefSeq). So, this causes multiple 'Max' PRs and Proteins 
generated, and they become technically always not equivalent to each other 
(speaking Paxtools terms; see  BioPAXElementImpl.isEquivalent and 
SequenceEntityReferenceImpl.semanticallyEquivalent, 
SimplePhysicalEntity.semanticallyEquivalent).

Try, see:
http://webservice.baderlab.org:48080/traverse?uri=http://identifiers.org/uniprot
/P61244&path=EntityReference/entityReferenceOf:Protein (most these Ps come from 
BIND)

http://pathwaycommons.baderlab.org/ProteinReference__1423006610558_Protein (see 
the P61244 ProteinReference object there and how many PRs were merged into it, 
according to comments)

I am thinking what to to... Obviously, there was no need for BIND db to repeat 
defining the same interactor hundred times, but it's probably too late to fix 
in original BIND PSI-MI.

Original comment by rod...@gmail.com on 6 Feb 2015 at 12:27

GoogleCodeExporter commented 9 years ago
Yesterday, I discovered that, instead of defining one interactor per protein 
type and then interactorRef to refer to that from all the corresponding 
participants, BIND PSI-MI XML fully defines exactly the same interactor for 
each interaction participant (e.g. "Max", hundreds times - just search for 
">Max<" string in the xml to find all of them).
After converting the PSI-MI to BioPAX, cPath2 is well capable of 
mapping/merging these PRs into a canonical (P61244) PR, but does not merge 
corresponding Proteins (due to how equals and semantic equivalence methods are 
designed in Paxtools, and also that auto-merging of Entity things is unsafe and 
uneasy anyway).

We must eventually fix this (and also previously known non-human taxonomy IDs 
issue, etc.) right in the official BIND PSI-MI XML file, which should also make 
it much smaller.

But I think I now know how to fix it in the psimi-converter. I may slightly 
change (once again) the way PR's URI is generated. Currently, the converter 
generates PR's URI using the primary unification xref if exists; if not, - it 
adds auto-generated integer value instead db_id. And this is what happens to 
most of those "Max" interactors and corresponding protein references, for they 
have two relationship xrefs (to entrezgene/locuslink - aka NCBI Gene - and 
OMIM) and no unification xrefs (we make a unification xref for a protein 
reference only if the psimi refType is 'identity' or 'identical object', which 
usually the case for uniprot/refseq refs). 

Let me fix it by using whatever primary psimi xref is present there (e.g.,  
entrezgene/locuslink) to generate a PR's URI (despite Gene is not Protein), 
which makes all those Proteins use the same PR, and then they can be merged 
right in the psimi-converter, before writing the final BioPAX OWL.

Original comment by rod...@gmail.com on 6 Feb 2015 at 5:56

GoogleCodeExporter commented 9 years ago
We have to make sure that interactors that are merged do not carry state 
information. For example, we should not merge interactors in different 
subcellular compartments ( under organism - very confusing).

Also not sure how to merge attributes..

Original comment by emekdemir on 6 Feb 2015 at 6:13

GoogleCodeExporter commented 9 years ago
Added the undirected querying support to the graph query infrastructure. Now 
the QueryExecuter class in Paxtools supports UNDIRECTED option for the 
direction, but only for the neighborhood query. This should solve the MI 
support problem.

The only unsolved portion of this thread is the duplicate protein issue I guess.

Original comment by ozgunba...@gmail.com on 6 Feb 2015 at 8:46

GoogleCodeExporter commented 9 years ago
I am still fixing paxtools psimi-converter to avoid generating duplicate PEs 
and ERs, while also not merging any non-equivalent things... 

Found another BIND data issue - some xrefs have id like "None", "NONE", e.g.:

<secondaryRef db="omim" dbAc="MI:0480" id="None" refTypeAc="MI:0251" 
refType="gene product"></secondaryRef>

Original comment by rod...@gmail.com on 8 Feb 2015 at 12:50

GoogleCodeExporter commented 9 years ago
See and compare the following query results:

http://pathwaycommons.baderlab.org/traverse?uri=http://identifiers.org/uniprot/P
61244&path=EntityReference/entityReferenceOf:Protein now returns only 32 URis 
instead of 943 - before fix;
Also compare:
http://webservice.baderlab.org:48080/search?q=name:%22MAX%22&type=protein&dataso
urce=bind - 8 hits vs.
http://www.pathwaycommons.org/pc2/search?q=name:%22MAX%22&type=protein&datasourc
e=bind - 939 hits (used to be many duplicates).

http://webservice.baderlab.org:48080/get?uri=http://identifiers.org/uniprot/P612
44 - the canonical ProteinReference does not contain hundreds biopax comments 
about REPLACED old URIs (those from BIND) anymore.

Undirected neighborhood works -
http://webservice.baderlab.org:48080/graph?source=MAX&datasource=bind&kind=neigh
borhood&direction=undirected&format=BINARY_SIF

Original comment by rod...@gmail.com on 13 Feb 2015 at 2:59