PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

There're probably many duplicate Protein objects (from PSIMI conv?) #186

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
As Ozgun lately noticed (in a email): "There is one RelationshipXref object for 
gene symbol TP53, and one related ProteinReference. However, there are 3800 
related Protein objects. Looks like we are duplicating Protein objects for each 
molecular interaction. We should merge these proteins if they are equivalent in 
terms of protein modifications. This is probably a psi-mi converter issue, 
right?"

See also: 
http://webservice.baderlab.org:48080/traverse?path=EntityReference/entityReferen
ceOf&uri=http://identifiers.org/uniprot/P04637
(P04637 is TP53, a ProteinReference)

It's possible that this problem is to and can be fixed in Paxtools' 
psimi-converter (then, we'd try - in the validator or here in cpath2).

Original issue reported on code.google.com by rod...@gmail.com on 14 Aug 2014 at 7:25

GoogleCodeExporter commented 9 years ago
More info:

Those Protein objects come from 10 different original data sources (see: 
http://webservice.baderlab.org:48080/traverse?path=EntityReference/entityReferen
ceOf/dataSource&uri=http://identifiers.org/uniprot/P04637);

and there are only 19 distinct 'displayName' values (which are the result of 
converting from corresponding PSIMI 'shortLabel' property), and 97 distinct 
'name' property values (includes the display and standard names);

and many of those proteins have really too long 'standardName' (that looks like 
comments), which are the result of converting from PSIMI 'fullName' property of 
a participant of an interaction...

Original comment by rod...@gmail.com on 14 Aug 2014 at 11:10

GoogleCodeExporter commented 9 years ago

Original comment by rod...@gmail.com on 10 Oct 2014 at 12:01