KEGG pathways share the same unification xref; one pathway has components, the other has none; both seem to be the same Thing

GoogleCodeExporter commented 9 years ago

Data source: kegg_hsa (KEGG hsa* translated with the KEGGTranslator)

E.g., UnificationXref_kegg_pathway_hsa00010 belongs to two pathways 
(http://www.pathwaycommons.org/pc2/traverse?uri=http://purl.org/pc2/7/Unificatio
nXref_kegg_pathway_hsa00010&path=Xref/xrefOf) in the model.
Looks, Pathway_307add3cea6530288cc1016267ec055b (A) is the same thing as  
Pathway_41e2e556906d0aafa01189286151a896 (B). But it's not A (full) but B 
(stub) is a pathwayComponent of 17 other pathways 
(http://www.pathwaycommons.org/pc2/traverse?uri=http://purl.org/pc2/7/Pathway_41
e2e556906d0aafa01189286151a896&path=Pathway/pathwayComponentOf)

Apparently, the KEGGTranslator did the following: generated e.g. hsa_00010.owl 
biopax, where pathway A is fully defined with all the components; if it has 
sub-pathways (refers to other hsa* IDs), then all those were simply defined as 
black box pathways inside the hsa_00010.owl file; i.e., if A is component of 
another pathway (file), such as of hsa00562, then a trivial black box B was 
generated in that corresponding (hsa00562) biopax model and other models, and 
used instead of A...

Apparently, these things are worth merging (Ideally, KEGGTranslator should have 
created only one biopax model from selected input hsa* xml files instead of 
making multiple models...)

Within current cPath2 importer design, we cannot merge all such "A"s and "B"s 
in a Cleaner, 
because cPath2 cleaners by design work with one input file at a time (same for 
normalizer and merger).
Also, we cannot simply normalize KEGG pathways to using 
http://identifiers.org/kegg.pathway/hsa00010 URIs 
(like we do in the cleaner for Reactome pathways that have stable REACT_* 
unification xrefs) and then hope the Merger would do the rest,.. because B 
might eventually replace A during the simple URI-based merging.

So, I would either write a special code (hack) in the Merger, or, better, a 
separate post-fix analysis to be run after all the data (not only KEGG) are 
merged. 

Shall we try to generalize this, i.e., to make it useful not only for the KEGG 
data case?..

Original issue reported on code.google.com by rod...@gmail.com on 12 Mar 2015 at 9:42

IgorRodchenkov commented 9 years ago

Modified the Paxtools' SimpleMerger; so using it with a filter, like

// SimpleMerger and a special Filter<BioPAXElement>
SimpleMerger merger = new SimpleMerger(SimpleEditorMap.L3, new Filter<BioPAXElement>() {
    public boolean filter(BioPAXElement object) {
        return object instanceof Pathway;
    }
});

helped close this issue (I've updated KeggHsaCleanerImplTest.java and Merger.java to test/use this new feature).

IgorRodchenkov commented 8 years ago

Last note, a follow-up. When a KEGG pathway with ID, e.g., hsa00010 is defined in its main biopax file (hsa_00010.owl), it has absolute URI like "path:hsa00010"; but when the same pathway is referenced from other files, e.g., hsa_00562.owl, its simplified/dummy version has a different URI "#pathhsa00010" (i.e,, "http://www.ra.cs.uni-tuebingen.de/software/KEGGtranslator/pathhsa00010") and the property looks like

<bp:pathwayComponent rdf:resource="#pathhsa00010" />

URIs of the same pathways must be exactly the same in all files, and be like "http://identifiers.org/kegg.pathway/hsa00010" instead.

PathwayCommons / cpath2

KEGG pathways share the same unification xref; one pathway has components, the other has none; both seem to be the same Thing #205