Closed GoogleCodeExporter closed 9 years ago
Original comment by ozgunba...@gmail.com
on 9 Jan 2015 at 4:44
Added a cleanup method is Paxtools to remove disconnected SimplePhysicalEntity
objects from the results. This should resolve the dangling proteins problem due
to psi-mi proteins.
There are still things to do. We need to design an undirected version of graph
queries to be able to support molecular interactions.
Another issue that we should solve is the duplicated proteins from psi-mi data.
We should merge those into a single protein (at least the ones without any
feature) per protein reference.
Original comment by ozgunba...@gmail.com
on 22 Jan 2015 at 3:49
Thanks Ozgun,
Good, I'll use the updated paxtools for the nearest cPath2/PC2 build..
Also, I am working to remove duplicate proteins from psi-mi data now...
Original comment by rod...@gmail.com
on 22 Jan 2015 at 4:48
Was the neighborhood type query directed (I think, the default is in most cases
- using both directions, i.e., no direction technically, no)?
Original comment by rod...@gmail.com
on 22 Jan 2015 at 4:50
It is directed. If limit=1 and direction is bothstream (default parameters),
then it is not different from an undirected neighborhood query. But if limit >
1, then you will see the difference even if the direction is bothstream.
Imagine the below graph:
A --> B
C --> B
B --> D
X --> A
A neighborhood query from A will not reach to C no matter what the parameters
are. That is because there is no directed path from A to C. A neighborhood
query from A with limit=2 and dir=bothstream will return X --> A --> B --> D.
Original comment by ozgunba...@gmail.com
on 22 Jan 2015 at 5:02
Will the default neighborhood query from A (limit=1) return X --> A --> B?
But I still don't quite understand how the original issue was related to
directionality...
That default neighborhood query using 'MAX' used to return (well too many
duplicate but) all MAX proteins but did not return any MI and the second
participants...
To me this issue was about that we intentionally or unintentionally excluded
all MIs from results.
Original comment by rod...@gmail.com
on 22 Jan 2015 at 5:12
Since the whole querying system is based on directed relations, it ignores
undirected relations like MI.
Yes, the default neighborhood query will return X --> A --> B.
Original comment by ozgunba...@gmail.com
on 22 Jan 2015 at 5:17
Currently it's "fixed" on the test server (at the Gary's lab).
Query
http://webservice.baderlab.org:48080/graph?source=MAX&datasource=bind&kind=neigh
borhood - now returns no result (because no MIs and no dangling proteins are
returned anymore).
Query http://webservice.baderlab.org:48080/graph?source=MAX&kind=neighborhood -
returns some result.
Shall we close this issue (and open a new one about adding MIs back to the
neighborhood results)?
Original comment by rod...@gmail.com
on 5 Feb 2015 at 10:59
Despite recent changes, also in the psimi-converter, there are still hundreds
"duplicate" proteins coming from BIND (potentially from any PSI-MI source).
This is not quite a trivial issue.
Good news is that, and it was before, most original "Max" PRs were merged into
a single canonical P61244 PR; bad news is that corresponding Protein objects
were not merged.
Apparently, I did not fix this entire problem this time, but now got a better
understanding...
In the original BIND XML, there are many identical or almost identical
participants and interactors that have name "Max" and two xrefs:
entrezgene/locuslink 4149 and omim 154950, which become RelationshipXref in
BioPAX. These cannot be UnificationXref (unlike a few other 'Max' cases where
one of xrefs is RefSeq). So, this causes multiple 'Max' PRs and Proteins
generated, and they become technically always not equivalent to each other
(speaking Paxtools terms; see BioPAXElementImpl.isEquivalent and
SequenceEntityReferenceImpl.semanticallyEquivalent,
SimplePhysicalEntity.semanticallyEquivalent).
Try, see:
http://webservice.baderlab.org:48080/traverse?uri=http://identifiers.org/uniprot
/P61244&path=EntityReference/entityReferenceOf:Protein (most these Ps come from
BIND)
http://pathwaycommons.baderlab.org/ProteinReference__1423006610558_Protein (see
the P61244 ProteinReference object there and how many PRs were merged into it,
according to comments)
I am thinking what to to... Obviously, there was no need for BIND db to repeat
defining the same interactor hundred times, but it's probably too late to fix
in original BIND PSI-MI.
Original comment by rod...@gmail.com
on 6 Feb 2015 at 12:27
Yesterday, I discovered that, instead of defining one interactor per protein
type and then interactorRef to refer to that from all the corresponding
participants, BIND PSI-MI XML fully defines exactly the same interactor for
each interaction participant (e.g. "Max", hundreds times - just search for
">Max<" string in the xml to find all of them).
After converting the PSI-MI to BioPAX, cPath2 is well capable of
mapping/merging these PRs into a canonical (P61244) PR, but does not merge
corresponding Proteins (due to how equals and semantic equivalence methods are
designed in Paxtools, and also that auto-merging of Entity things is unsafe and
uneasy anyway).
We must eventually fix this (and also previously known non-human taxonomy IDs
issue, etc.) right in the official BIND PSI-MI XML file, which should also make
it much smaller.
But I think I now know how to fix it in the psimi-converter. I may slightly
change (once again) the way PR's URI is generated. Currently, the converter
generates PR's URI using the primary unification xref if exists; if not, - it
adds auto-generated integer value instead db_id. And this is what happens to
most of those "Max" interactors and corresponding protein references, for they
have two relationship xrefs (to entrezgene/locuslink - aka NCBI Gene - and
OMIM) and no unification xrefs (we make a unification xref for a protein
reference only if the psimi refType is 'identity' or 'identical object', which
usually the case for uniprot/refseq refs).
Let me fix it by using whatever primary psimi xref is present there (e.g.,
entrezgene/locuslink) to generate a PR's URI (despite Gene is not Protein),
which makes all those Proteins use the same PR, and then they can be merged
right in the psimi-converter, before writing the final BioPAX OWL.
Original comment by rod...@gmail.com
on 6 Feb 2015 at 5:56
We have to make sure that interactors that are merged do not carry state
information. For example, we should not merge interactors in different
subcellular compartments ( under organism - very confusing).
Also not sure how to merge attributes..
Original comment by emekdemir
on 6 Feb 2015 at 6:13
Added the undirected querying support to the graph query infrastructure. Now
the QueryExecuter class in Paxtools supports UNDIRECTED option for the
direction, but only for the neighborhood query. This should solve the MI
support problem.
The only unsolved portion of this thread is the duplicate protein issue I guess.
Original comment by ozgunba...@gmail.com
on 6 Feb 2015 at 8:46
I am still fixing paxtools psimi-converter to avoid generating duplicate PEs
and ERs, while also not merging any non-equivalent things...
Found another BIND data issue - some xrefs have id like "None", "NONE", e.g.:
<secondaryRef db="omim" dbAc="MI:0480" id="None" refTypeAc="MI:0251"
refType="gene product"></secondaryRef>
Original comment by rod...@gmail.com
on 8 Feb 2015 at 12:50
See and compare the following query results:
http://pathwaycommons.baderlab.org/traverse?uri=http://identifiers.org/uniprot/P
61244&path=EntityReference/entityReferenceOf:Protein now returns only 32 URis
instead of 943 - before fix;
Also compare:
http://webservice.baderlab.org:48080/search?q=name:%22MAX%22&type=protein&dataso
urce=bind - 8 hits vs.
http://www.pathwaycommons.org/pc2/search?q=name:%22MAX%22&type=protein&datasourc
e=bind - 939 hits (used to be many duplicates).
http://webservice.baderlab.org:48080/get?uri=http://identifiers.org/uniprot/P612
44 - the canonical ProteinReference does not contain hundreds biopax comments
about REPLACED old URIs (those from BIND) anymore.
Undirected neighborhood works -
http://webservice.baderlab.org:48080/graph?source=MAX&datasource=bind&kind=neigh
borhood&direction=undirected&format=BINARY_SIF
Original comment by rod...@gmail.com
on 13 Feb 2015 at 2:59
Original issue reported on code.google.com by
rod...@gmail.com
on 8 Jan 2015 at 10:57