PathwayCommons / cpath2

Biological pathway data integration and access platform (Pathway Commons)
http://www.pathwaycommons.org/pc2/
MIT License
6 stars 5 forks source link

Examples of large pathways #267

Closed d2fong closed 7 years ago

d2fong commented 7 years ago

@IgorRodchenkov @gbader

Here are some of the large pathways, sorted by number of nodes:

[ 
    [
        "http___pathwaycommons.org_pc2_Pathway_b3ac7d02900640422051a6eeda9fc5c3.xml",
        11673
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_be83809c98b3bd533c1eb159ac1e140f.xml",
        8843
    ],
    [
        "http___identifiers.org_kegg.pathway_hsa01100.xml",
        7871
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_8ceef61cf5c6bc7c8f8cc72a779c815f.xml",
        7413
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_2f027cba94185f4f2743c3cf1d3a8e5c.xml",
        6808
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_4c814fc3d6dc7fcea1239baba9b42dce.xml",
        6194
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_2c39e3007cbc93df7c32215a783fc9b2.xml",
        5878
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_846aa0b67388d6e3aee58bbfdf25b07b.xml",
        5700
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_e6b30fcf07413aa04c1fe30136f1e18b.xml",
        5062
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_857a841247f0de6bbd4345f37f9a90c7.xml",
        5043
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_14cbec1fe4c053859e09d6ad777965ad.xml",
        5025
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_7c48d51ca7d699ff282f0e522f3d653f.xml",
        4814
    ],
    [
        "http___pathwaycommons.org_pc2_Pathway_7b499992619fef06658d50a9cd91bede.xml",
        4452
    ],
    [
        "http___identifiers.org_reactome_R-HSA-452723.xml",
        4076
    ],
]
IgorRodchenkov commented 7 years ago

Let me explain what it is. Files listed above were generated by a shell script (using jq and wget) as follows: for each Pathway URI in the beta PC9 db we got the BioPAX data using beta.pathwaycommons.org/pc2/get?uri=.. web query (file names were created from corresponding pathway URIs by replacing '/' and ':'); this returns the pathway BioPAX model (except sub-pathways); the data were then converted to SBGN with java and paxtools.jar toSBGN command (no layout applied).

But the question here is simply about why pathways are "too large" ;) This does not sound like a problem to me, and it's unclear what's exactly the question is here... But I have some ideas.

Well, e.g., let's look into the first and the last pathway, to begin with. I am going to analyse some of properties and numbers about these two pathways in order to tell whether something suspicious is in beta PC9 but not in PC8 db, or in both.

To be continued...

IgorRodchenkov commented 7 years ago

1. http___pathwaycommons.org_pc2_Pathway_b3ac7d02900640422051a6eeda9fc5c3.xml

Pathway name is "CAGGTG_V$E12_Q6" (it's a pathway from, as we call it, TRANSFAC - MSigDB v5.2 C3 dataset): http://beta.pathwaycommons.org/pc2/traverse?path=Pathway/name&uri=http://pathwaycommons.org/pc2/Pathway_b3ac7d02900640422051a6eeda9fc5c3

We need to know its name to find and analyse the same pathway in PC8 and PC9 (URIs could be different, but they are in fact the same due to being generated from the same original MSigDB v5.2 C3 xml data the same way, which is not always the case for other entities, i.e., not guaranteed unless the original URI was or could be obviously set to a standard valid Identifiers.org one):

http://www.pathwaycommons.org/pc2/search?q=name:%22CAGGTG_V$E12_Q6%22&type=pathway
http://beta.pathwaycommons.org/pc2/search?q=name:%22CAGGTG_V$E12_Q6%22&type=pathway

You can see that the number of processes in both cases is the same 7830 ('size' in PC8 means the same as 'numProcesses' in PC9).

So, no surprises here so far.

PS: shall we update beta PC9 to use MSigDB v6.0 C3 dataset?

To be continued (I'll look into the original data file)...

d2fong commented 7 years ago

Okay I removed some of the smaller ones from this list.

IgorRodchenkov commented 7 years ago

(see previous comments) ... this pathway is the result of converting the

<GENESET STANDARD_NAME="CAGGTG_V$E12_Q6" SYSTEMATIC_NAME="M5864"... DESCRIPTION_BRIEF="Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif CAGGTG which matches annotation for TCF3: transcription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47)"...

For each TFT gene set, the data converter creates a separate pathway, identified by the unique name; and within a pathway, the transcription factor positively regulates the transcription of targets listed in the corresponding gene set. I.e., for each TF(s) , it generates a new TemplateReactionRegulation (TRR), where the same transcription factor (TF) controls a TemplateReaction (TR)... And all these processes become pathway components of the pathway (CAGGTG_V$E12_Q6), which makes it very large.

I don't see a problem here, i.e, nothing fixable related to large size...

PS: I doubt though, why TFs (controllers of TRRs) are there modelled as Rna rather than Protein?.. Sounds like a data converter issue to me; so created PathwayCommons/msigdb-to-biopax#2.

IgorRodchenkov commented 7 years ago

2. http___identifiers.org_reactome_R-HSA-452723.xml file Pathway URI="http://identifiers.org/reactome/R-HSA-452723" Name: "Transcriptional regulation of pluripotent stem cells"

Find by xref/id in PC8 and PC9:

http://www.pathwaycommons.org/pc2/search?q=xrefid:R?HSA?452723&type=pathway
http://beta.pathwaycommons.org/pc2/search?q=xrefid:R?HSA?452723&type=pathway

To be continued...

IgorRodchenkov commented 7 years ago

Analysed R-HSA-452723 pathway (got the BioPAX from PC9 and run, e.g., java paxtools.jar summarize and toSBGN commands).

Now, this issue reminds me old annoying 'nextStep' problem once again... not so trivial this time though.

There are 519 PathwayStep objects in the beta PC9 R-HSA-452723 pathway's model, despite the only Pathway object in there has just a couple of dozens processes in its pathwayComponent and pathwayOrder properties. No good.

I suspected Completer first, but nextStep property interface has @AutoComplete(forward=false) annotation, which also means backward=false as well... and this should work. So, it might be a bug in the Cloner class...

Debugging...

IgorRodchenkov commented 7 years ago

it's (Reactome pathways case) fixed in PC8 now...

IgorRodchenkov commented 7 years ago

Fixed in beta PC9