Closed LucaCappelletti94 closed 2 years ago
Dear Luca, sorry for the delayed response, most of our group was on holiday the past month.
Thank you very much for the suggestion and interest. Currently, the database is also available in GraphML (XML) format on the website, which is already easier to process than starting the Neo4j database. However, we are aware that GraphML isn't the most comfortable format to work with. A TSV version is definitely possible and we will evaluate formatting the heterogeneous node and edge properties in TSV tomorrow. We'll let you know once a TSV version is available for download!
When the TSV version is available, I will shortly provide a graph report (it often detects oddities that may require addressing), and afterwards, we can go for a few node embedding! Looking forward to it!
Dear Luca,
we created a TSV version of the database with one file for the nodes and one for the edges. The most common properties of all nodes and edges respectively are directly available as columns and all remaining properties are provided as a column with JSON values.
We are looking forward to your report and general feedback and hope that the TSV database may be helpful in your research!
On it! Thank you!
Hello, I have tried to download the file hosted on Zenodo, now three times, but somehow:
Would it be possible to host it somewhere else? Also, consider using a .tar.gz
extension, if possible.
The upload was very slow earlier as well, maybe Zenodo had some issues. I downloaded the file (which was fast again) but recognized a header error in the zip file. I created a fresh tar.gz archive, uploaded a new version and rechecked the downloaded file from Zenodo. Everything should now be correct: 10.5281/zenodo.7011027
Ok, I also managed to download it! I should get back to you with the report and a first quick node embedding within the hour, save explosions.
I assume it is a directed graph, right? Here is the graph report for the directed version. In the following comment, I will share the undirected version.
The directed multigraph PharMeBINet has 2.87M heterogeneous nodes and 15.88M heterogeneous edges. The RAM requirements for the nodes and edges data structures are 225.51MB and 42.87MB respectively.
The minimum node degree is 0, the maximum node degree is 34.28K, the mode degree is 0, the mean degree is 5.53 and the node degree median is 0.
The nodes with the highest degree centrality are 2339298 (degree 34.28K and node types Chemical and Compound), 2339077 (degree 24.82K and node types Chemical and Compound), 2853717 (degree 23.85K and node type Anatomy), 2853780 (degree 23.58K and node type Anatomy) and 2853431 (degree 23.10K and node type Anatomy).
The graph has 58 node types, of which the 10 most common are Variant (1.54M nodes, 53.80%), GeneVariant (1.54M nodes, 53.75%), SingleNucleotideVariant (1.31M nodes, 45.79%), Interaction (637.48K nodes, 22.22%), Product (272.00K nodes, 9.48%), Chemical (184.51K nodes, 6.43%), Deletion (88.33K nodes, 3.08%), Gene (74.84K nodes, 2.61%), Phenotype (56.93K nodes, 1.98%) and Duplication (41.38K nodes, 1.44%). The node types are multi-label, and the node with most node types has 5 node types. The RAM requirement for the node types data structure is 183.65MB.
Singleton node types are node types that are assigned exclusively to a single node, making the node type relatively meaningless, as it adds no more information than the node name itself. The graph contains 2 singleton node types, which are DistinctChromosomes (2857429 (degree 2 and node types DistinctChromosomes, Genotype and Variant)) and TandemDuplication (2861979 (node types GeneVariant, TandemDuplication and Variant)).
The graph has 208 edge types, of which the 10 most common are RESEMBLES_CrC (4.60M edges, 28.97%), HAS_GhGV (2.69M edges, 16.93%), MIGHT_SUBCELLULAR_LOCATES_ImslCC (1.77M edges, 11.18%), INTERACTS_CiC (1.38M edges, 8.70%), INTERACTS_PiI (637.48K edges, 4.01%), INTERACTS_IiP (637.48K edges, 4.01%), EXPRESSES_AeG (526.18K edges, 3.31%), HAS_ChPR (404.03K edges, 2.54%), REGULATES_GrG (265.67K edges, 1.67%) and DOWNREGULATES_CHdG (214.26K edges, 1.35%). The RAM requirement for the edge types data structure is 63.54MB.
Singleton edge types are edge types that are assigned exclusively to a single edge, making the edge type relatively meaningless, as it adds no more information than the name of edge itself. The graph contains 8 edges with singleton edge types, which are REGULATES_CHrCH, INCREASES_DEGENERATION_GidCH, NOT_ACTS_UPSTREAM_OF_OR_WITHIN_NEGATIVE_EFFECT_PnauoowneBP, IS_ACTIVE_ON_DNA_OR_RNA_LEVEL_PiaodorlCH, PRECEDING_REACTION_RLEprPW, ASSOCIATES_TO_TOXICITY_ADR_VattaCH, DOWNREGULATES_GdCH and NOT_ACTS_UPSTREAM_OF_OR_WITHIN_NEGATIVE_EFFECT_GnauoowneBP.
A topological oddity is a set of nodes in the graph that may be derived by an error during the generation of the edge list of the graph and, depending on the task, could bias the results of topology-based models. Note that in a directed graph we only support the detection of isomorphic nodes. In the following paragraph, we will describe the detected topological oddities.
A singleton node is a node disconnected from all other nodes. We have detected 215.66K singleton nodes in the graph, involving a total of 215.66K nodes (7.52%). The detected singleton nodes are:
4592 (node type Gene)
6288 (node type Gene)
6368 (node type Gene)
7712 (node type Gene)
13056 (node type Gene)
15104 (node type Gene)
18048 (node type Gene)
18592 (node type Gene)
22528 (node type Gene)
24320 (node type Gene)
27040 (node type Gene)
31920 (node type Gene)
33040 (node type Gene)
51472 (node types BlackBoxEvent and ReactionLikeEvent)
77504 (node types GeneVariant, SingleNucleotideVariant and Variant)
And other 215.64K singleton nodes.
Isomorphic groups are nodes with exactly the same neighbours and node types (if present in the graph). Nodes in such groups are topologically indistinguishable, that is swapping their ID would not change the graph topology. We have detected 26.75K isomorphic node groups in the graph, involving a total of 212.75K nodes (7.41%) and 1.27M edges (8.00%), with the largest one involving 1.15K nodes and 6.13K edges. The detected isomorphic node groups, sorted by decreasing size, are:
Group with 42 nodes (degree 146 and node type Gene): 140420, 140426, 140442, 140441, 140458 and other 37.
Group with 1.15K nodes (degree 5 and node type Interaction): 360595, 359714, 359322, 359274, 359162 and other 1.14K.
Group with 566 nodes (degree 5 and node type Interaction): 693913, 695050, 694151, 694544, 694205 and other 561.
Group with 552 nodes (degree 5 and node type Interaction): 685866, 685966, 686221, 686351, 686515 and other 547.
Group with 465 nodes (degree 5 and node type Interaction): 516631, 516145, 516805, 516789, 516773 and other 460.
Group with 413 nodes (degree 5 and node type Interaction): 385360, 385428, 385417, 385433, 385449 and other 408.
Group with 310 nodes (degree 6 and node type Interaction): 694958, 694667, 694081, 693937, 694705 and other 305.
Group with 12 nodes (degree 154 and node type Gene): 24503, 22147, 2180, 9586, 18095 and other 7.
Group with 307 nodes (degree 6 and node type Interaction): 383975, 384094, 384001, 383880, 383766 and other 302.
Group with 333 nodes (degree 5 and node type Interaction): 358008, 357998, 358044, 358138, 358106 and other 328.
Group with 230 nodes (degree 7 and node type Interaction): 619194, 619197, 619086, 619081, 619294 and other 225.
Group with 314 nodes (degree 5 and node type Interaction): 481979, 482445, 482126, 482370, 482386 and other 309.
Group with 309 nodes (degree 5 and node type Interaction): 5376, 5555, 5723, 5705, 5648 and other 304.
Group with 303 nodes (degree 5 and node type Interaction): 372866, 372471, 372734, 372672, 372632 and other 298.
Group with 251 nodes (degree 6 and node type Interaction): 360137, 360063, 359758, 360296, 360525 and other 246.
And other 26.73K isomorphic node groups.
Analogous undirected version with more information (sorry for the repost, but I had forgotten to add the parameter to split on the node types |
).
The undirected multigraph PharMeBINet has 2.87M heterogeneous nodes and 15.88M heterogeneous edges. The graph contains 216.12K connected components (of which 215.66K are disconnected nodes), with the largest one containing 2.65M nodes and the smallest one containing a single node. The RAM requirements for the nodes and edges data structures are 225.51MB and 81.76MB respectively.
The minimum node degree is 0, the maximum node degree is 535.60K, the mode degree is 1, the mean degree is 11.07 and the node degree median is 1.
The nodes with the highest degree centrality are 323312 (degree 535.60K and node type CellularComponent), 323245 (degree 455.25K and node type CellularComponent), 323255 (degree 324.33K and node type CellularComponent), 323405 (degree 182.48K and node type CellularComponent) and 323313 (degree 149.46K and node type CellularComponent).
The graph has 58 node types, of which the 10 most common are Variant (1.54M nodes, 53.80%), GeneVariant (1.54M nodes, 53.75%), SingleNucleotideVariant (1.31M nodes, 45.79%), Interaction (637.48K nodes, 22.22%), Product (272.00K nodes, 9.48%), Chemical (184.51K nodes, 6.43%), Deletion (88.33K nodes, 3.08%), Gene (74.84K nodes, 2.61%), Phenotype (56.93K nodes, 1.98%) and Duplication (41.38K nodes, 1.44%). The node types are multi-label, and the node with most node types has 5 node types. The RAM requirement for the node types data structure is 183.65MB.
Singleton node types are node types that are assigned exclusively to a single node, making the node type relatively meaningless, as it adds no more information than the node name itself. The graph contains 2 singleton node types, which are DistinctChromosomes (2857429 (degree 2 and node types DistinctChromosomes, Genotype and Variant)) and TandemDuplication (2861979 (degree 1 and node types GeneVariant, TandemDuplication and Variant)).
The graph has 208 edge types, of which the 10 most common are RESEMBLES_CrC (9.20M edges, 28.97%), HAS_GhGV (5.38M edges, 16.93%), MIGHT_SUBCELLULAR_LOCATES_ImslCC (3.55M edges, 11.18%), INTERACTS_CiC (2.76M edges, 8.70%), INTERACTS_IiP (1.27M edges, 4.02%), INTERACTS_PiI (1.27M edges, 4.02%), EXPRESSES_AeG (1.05M edges, 3.31%), HAS_ChPR (808.06K edges, 2.54%), REGULATES_GrG (527.96K edges, 1.66%) and DOWNREGULATES_CHdG (428.53K edges, 1.35%). The RAM requirement for the edge types data structure is 127.04MB.
A topological oddity is a set of nodes in the graph that may be derived by an error during the generation of the edge list of the graph and, depending on the task, could bias the results of topology-based models. In the following paragraph, we will describe the detected topological oddities.
A singleton node is a node disconnected from all other nodes. We have detected 215.66K singleton nodes in the graph, involving a total of 215.66K nodes (7.52%). The detected singleton nodes are:
4592 (node type Gene)
6288 (node type Gene)
6368 (node type Gene)
7712 (node type Gene)
13056 (node type Gene)
15104 (node type Gene)
18048 (node type Gene)
18592 (node type Gene)
22528 (node type Gene)
24320 (node type Gene)
27040 (node type Gene)
31920 (node type Gene)
33040 (node type Gene)
51472 (node types BlackBoxEvent and ReactionLikeEvent)
77504 (node types GeneVariant, SingleNucleotideVariant and Variant)
And other 215.64K singleton nodes.
A node tuple is a connected component composed of two nodes. We have detected 196 node tuples in the graph, involving a total of 392 nodes (0.01%) and 196 edges. The detected node tuples are:
Node tuple containing the nodes 2351487 (node types Chemical and Compound) and 2625727 (node type Product).
Node tuple containing the nodes 2829127 (node types GeneVariant, SingleNucleotideVariant and Variant) and 2843527 (node types VariantAnnotation and VariantFunctionalAnalysisAnnotation).
Node tuple containing the nodes 2809611 (node type PharmacologicClass) and 2810667 (node type PharmacologicClass).
Node tuple containing the nodes 2625675 (node type Product) and 2351439 (node types Chemical and Compound).
Node tuple containing the nodes 2600635 (node type Product) and 2347079 (node types Chemical and Compound).
Node tuple containing the nodes 2600611 (node type Product) and 2347067 (node types Chemical and Compound).
Node tuple containing the nodes 2832957 (node types GeneVariant, SingleNucleotideVariant and Variant) and 158559 (node type Gene).
Node tuple containing the nodes 2831357 (node types GeneVariant, SingleNucleotideVariant and Variant) and 158207 (node type Gene).
Node tuple containing the nodes 2830877 (node types GeneVariant, SingleNucleotideVariant and Variant) and 164763 (node type Gene).
Node tuple containing the nodes 2820493 (node types Phenotype and SideEffect) and 2862284 (node type Phenotype).
Node tuple containing the nodes 2809773 (node type PharmacologicClass) and 2810653 (node type PharmacologicClass).
Node tuple containing the nodes 2809021 (node type PharmacologicClass) and 2809043 (node type PharmacologicClass).
Node tuple containing the nodes 2620973 (node type Product) and 2350011 (node types Chemical and Compound).
Node tuple containing the nodes 2600781 (node type Product) and 2347111 (node types Chemical and Compound).
Node tuple containing the nodes 2600461 (node type Product) and 2347031 (node types Chemical and Compound).
And other 181 node tuples.
Isomorphic groups are nodes with exactly the same neighbours and node types (if present in the graph). Nodes in such groups are topologically indistinguishable, that is swapping their ID would not change the graph topology. We have detected 5.13K isomorphic node groups in the graph, involving a total of 21.77K nodes (0.76%) and 589.58K edges (1.86%), with the largest one involving 1.69K nodes and 20.30K edges. The detected isomorphic node groups, sorted by decreasing size, are:
Group with 15 nodes (degree 1.35K and node types CopyNumberGain, GeneVariant and Variant): 2286585, 2288568, 2288455, 2314789, 2283894 and other 10.
Group with 8 nodes (degree 1.55K and node types CopyNumberGain, GeneVariant and Variant): 2282030, 2288578, 2287834, 2286127, 2288765 and other 3.
Group with 15 nodes (degree 820 and node types CopyNumberLoss, GeneVariant and Variant): 2338469, 2316715, 2289404, 2289400, 2295260 and other 10.
Group with 14 nodes (degree 821 and node types CopyNumberGain, GeneVariant and Variant): 2289365, 2332207, 2316704, 2303583, 2289418 and other 9.
Group with 8 nodes (degree 1.35K and node types CopyNumberLoss, GeneVariant and Variant): 2312556, 2287572, 2310146, 2288771, 2316247 and other 3.
Group with 21 nodes (degree 486 and node types CopyNumberGain, GeneVariant and Variant): 2284409, 2287545, 2285830, 2284080, 2284027 and other 16.
Group with 1.69K nodes (degree 5 and node type Product): 2541426, 2541239, 2541304, 2536703, 2541352 and other 1.69K.
Group with 4 nodes (degree 1.46K and node types CopyNumberGain, GeneVariant and Variant): 2285962, 2283118, 2288293 and 2287799.
Group with 5 nodes (degree 1.14K and node types CopyNumberGain, GeneVariant and Variant): 2294089, 2298060, 2294095, 2298058 and 2298181.
Group with 4 nodes (degree 1.35K and node types CopyNumberLoss, GeneVariant and Variant): 2316321, 2288865, 2282712 and 2311574.
Group with 5 nodes (degree 1.03K and node types CopyNumberGain, GeneVariant and Variant): 2300376, 2300375, 2294994, 2295008 and 2300373.
Group with 6 nodes (degree 811 and node types CopyNumberGain, GeneVariant and Variant): 2297898, 2294799, 2297958, 2294798, 2290948 and another one.
Group with 53 nodes (degree 87 and node types CopyNumberGain, GeneVariant and Variant): 2283559, 2315991, 2311900, 2282595, 2315362 and other 48.
Group with 5 nodes (degree 896 and node types CopyNumberGain, GeneVariant and Variant): 2294652, 2305639, 2294654, 2300737 and 2300764.
Group with 8 nodes (degree 552 and node types CopyNumberGain, GeneVariant and Variant): 2282038, 2283630, 2288475, 2286388, 2282819 and other 3.
And other 5.11K isomorphic node groups.
A tree is a connected component with n
nodes and n-1
edges. We have detected 2 trees in the graph, involving a total of 13 nodes and 22 edges, with the largest one involving 7 nodes and 12 edges. The detected trees, sorted by decreasing size, are:
Tree starting from the root node 2350364 (degree 2 and node types Chemical and Compound), and containing 7 nodes, with a maximal depth of 2, which are 2623961 (degree 3 and node type Product), 2623957 (degree 3 and node type Product), 2350618 (node types Chemical and Compound), 2350617 (node types Chemical and Compound) and 2350534 (node types Chemical and Compound). Its nodes have 3 node types, which are Chemical (4 nodes), Compound (4 nodes) and Product (2 nodes). Its edges have a single edge type, which is HAS_ChPR.
Tree starting from the root node 2600327 (degree 2 and node type Product), and containing 6 nodes, with a maximal depth of 2, which are 2346992 (degree 3 and node types Chemical and Compound), 2346993 (node types Chemical and Compound), 2600328 (node type Product), 2600329 (node type Product) and 2600330 (node type Product). Its nodes have 3 node types, which are Product (3 nodes), Chemical (2 nodes) and Compound (2 nodes). Its edges have a single edge type, which is HAS_ChPR.
A dendritic tree is a tree-like structure starting from a root node that is part of another strongly connected component. We have detected 557 dendritic trees in the graph, involving a total of 33.12K nodes (1.15%) and 66.25K edges (0.21%), with the largest one involving 20.76K nodes and 41.52K edges. The detected dendritic trees, sorted by decreasing size, are:
Dendritic tree starting from the root node 2339077 (degree 25.56K and node types Chemical and Compound), and containing 20.76K nodes, with a maximal depth of 3, which are 2468112 (node type Product), 2468128 (node type Product), 2468144 (node type Product), 2468160 (node type Product) and 2468176 (node type Product). Its nodes have 3 node types, which are Product (20.76K nodes, 0.72%), Compound and Chemical. Its edges have a single edge type, which is HAS_ChPR.
Dendritic tree starting from the root node 23540 (degree 4.26K and node types Chemical and Compound), and containing 1.01K nodes, with a maximal depth of 2, which are 2397872 (node type Product), 2397888 (node type Product), 2398224 (node type Product), 2398240 (node type Product) and 2398256 (node type Product). Its nodes have 12 node types, of which the 10 most common are Product (1.00K nodes, 0.03%), VariantAnnotation (2 nodes), SideEffect (2 nodes), Phenotype (2 nodes), VariantDrugAnnotation (2 nodes), GeneVariant (2 nodes), SingleNucleotideVariant (2 nodes), Variant (2 nodes), Chemical and Compound. Its edges have 6 edge types, which are HAS_ChPR (1.00K edges, 99.01%), ASSOCIATES_VAaV (4 edges, 0.40%), CAUSES_CHcSE (2 edges, 0.20%), ASSOCIATES_VAaCH (2 edges, 0.20%), PART_OF_CpoSA and INCLUDES_PCiCH.
Dendritic tree starting from the root node 2347592 (degree 1.11K and node types Chemical and Compound), and containing 932 nodes, with a maximal depth of 3, which are 2618080 (node type Product), 2618096 (node type Product), 2618112 (node type Product), 2618128 (node type Product) and 2618144 (node type Product). Its nodes have 3 node types, which are Product (931 nodes, 0.03%), Chemical and Compound. Its edges have a single edge type, which is HAS_ChPR.
Dendritic tree starting from the root node 2339298 (degree 34.52K and node types Chemical and Compound), and containing 617 nodes, with a maximal depth of 2, which are 140144 (node type Gene), 2352208 (node types Chemical, Compound and Salt), 2386160 (node type Product), 2386176 (node type Product) and 2386192 (node type Product). Its nodes have 13 node types, of which the 10 most common are Product (512 nodes, 0.02%), GeneVariant (79 nodes), Variant (79 nodes), SingleNucleotideVariant (74 nodes), Gene (21 nodes), Deletion (3 nodes), Phenotype (3 nodes), SideEffect (3 nodes), Chemical (2 nodes) and Salt (2 nodes). Its edges have 8 edge types, which are HAS_ChPR (512 edges, 73.56%), HAS_GhGV (158 edges, 22.70%), IS_ACTIVE_IN_METABOLISM_CHiaimG (16 edges, 2.30%), DOWNREGULATES_CHdG (3 edges, 0.43%), CAUSES_CHcSE (2 edges, 0.29%), PART_OF_CpoSA (2 edges, 0.29%), UPREGULATES_CHuG (2 edges, 0.29%) and MIGHT_CAUSES_CHmcSE.
Dendritic tree starting from the root node 23084 (degree 4.54K and node types Chemical and Compound), and containing 615 nodes, with a maximal depth of 3, which are 2444624 (node type Product), 2444640 (node type Product), 2444656 (node type Product), 2444672 (node type Product) and 2444688 (node type Product). Its nodes have 13 node types, of which the 10 most common are Product (604 nodes, 0.02%), SingleNucleotideVariant (3 nodes), GeneVariant (3 nodes), VariantAnnotation (3 nodes), Variant (3 nodes), VariantPhenotypeAnnotation (2 nodes), Compound (2 nodes), Chemical (2 nodes), Phenotype (2 nodes) and Salt (2 nodes). Its edges have 7 edge types, which are HAS_ChPR (604 edges, 97.58%), ASSOCIATES_VAaV (6 edges, 0.97%), ASSOCIATES_VAaCH (3 edges, 0.48%), HAS_GhV (2 edges, 0.32%), PART_OF_CpoSA (2 edges, 0.32%), MIGHT_CAUSES_CHmcSE and CAUSES_CHcSE.
Dendritic tree starting from the root node 23517 (degree 3.84K and node types Chemical and Compound), and containing 528 nodes, with a maximal depth of 2, which are 2411104 (node type Product), 2411120 (node type Product), 2411136 (node type Product), 2411152 (node type Product) and 2411168 (node type Product). Its nodes have 6 node types, which are Product (522 nodes, 0.02%), Phenotype (4 nodes), Chemical (2 nodes), SideEffect (2 nodes), Compound (2 nodes) and Salt (2 nodes). Its edges have 4 edge types, which are HAS_ChPR (522 edges, 98.49%), EQUAL_PTeSE (4 edges, 0.75%), PART_OF_CpoSA (2 edges, 0.38%) and CAUSES_CHcSE (2 edges, 0.38%).
And other 551 dendritic trees.
A star is a tree with a maximal depth of one, where nodes with maximal unique degree one are connected to a central root node with a high degree. We have detected 247 stars in the graph, involving a total of 2.08K nodes (0.07%) and 3.66K edges (0.01%), with the largest one involving 49 nodes and 96 edges. The detected stars, sorted by decreasing size, are:
Star starting from the root node 2857435 (degree 48), and containing 49 nodes, with a maximal depth of 1, which are 168128, 182352, 177400, 182344 and 182424. Its nodes have a single node type, which is Gene. Its edges have a single edge type, which is PRODUCES_GpP.
Star starting from the root node 2347626 (degree 35), and containing 36 nodes, with a maximal depth of 1, which are 2618992, 2619008, 2618984, 2619000 and 2619016. Its nodes have a single node type, which is Product. Its edges have a single edge type, which is HAS_ChPR.
Star starting from the root node 2346635 (degree 26), and containing 27 nodes, with a maximal depth of 1, which are 2597232, 2597224, 2597240, 2597236 and 2597228. Its nodes have a single node type, which is Product. Its edges have a single edge type, which is HAS_ChPR.
Star starting from the root node 2346640 (degree 25), and containing 26 nodes, with a maximal depth of 1, which are 2597328, 2597320, 2597336, 2597316 and 2597332. Its nodes have a single node type, which is Product. Its edges have a single edge type, which is HAS_ChPR.
Star starting from the root node 2346650 (degree 25), and containing 26 nodes, with a maximal depth of 1, which are 2597536, 2597528, 2597544, 2597540 and 2597532. Its nodes have a single node type, which is Product. Its edges have a single edge type, which is HAS_ChPR.
Star starting from the root node 2346647 (degree 23), and containing 24 nodes, with a maximal depth of 1, which are 2597456, 2597472, 2597464, 2597460 and 2597452. Its nodes have a single node type, which is Product. Its edges have a single edge type, which is HAS_ChPR.
And other 241 stars.
A dendritic star is a dendritic tree with a maximal depth of one, where nodes with maximal unique degree one are connected to a central root node with high degree and inside a strongly connected component. We have detected 15.70K dendritic stars in the graph, involving a total of 1.55M nodes (53.87%) and 3.09M edges (9.74%), with the largest one involving 14.89K nodes and 29.79K edges. The detected dendritic stars, sorted by decreasing size, are:
Dendritic star starting from the root node 13491 (degree 15.33K and node type Gene), and containing 14.89K nodes, with a maximal depth of 1, which are 66464 (node types GeneVariant, SingleNucleotideVariant and Variant), 66480 (node types GeneVariant, SingleNucleotideVariant and Variant), 66496 (node types GeneVariant, SingleNucleotideVariant and Variant), 66512 (node types GeneVariant, SingleNucleotideVariant and Variant) and 66528 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 13 node types, of which the 10 most common are GeneVariant (14.89K nodes, 0.52%), Variant (14.89K nodes, 0.52%), SingleNucleotideVariant (11.02K nodes, 0.38%), Deletion (2.23K nodes, 0.08%), Duplication (770 nodes, 0.03%), Insertion (319 nodes, 0.01%), Indel (258 nodes), Microsatellite (226 nodes), ProteinOnly (50 nodes) and CopyNumberGain (8 nodes). Its edges have a single edge type, which is HAS_GhGV.
Dendritic star starting from the root node 24314 (degree 13.58K and node type Gene), and containing 12.57K nodes, with a maximal depth of 1, which are 66400 (node types GeneVariant, SingleNucleotideVariant and Variant), 66416 (node types GeneVariant, SingleNucleotideVariant and Variant), 66432 (node types GeneVariant, SingleNucleotideVariant and Variant), 66448 (node types GeneVariant, SingleNucleotideVariant and Variant) and 66608 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 12 node types, of which the 10 most common are GeneVariant (12.57K nodes, 0.44%), Variant (12.57K nodes, 0.44%), SingleNucleotideVariant (9.27K nodes, 0.32%), Deletion (1.92K nodes, 0.07%), Duplication (635 nodes, 0.02%), Insertion (351 nodes, 0.01%), Indel (190 nodes), Microsatellite (162 nodes), ProteinOnly (31 nodes) and Variation (12 nodes). Its edges have a single edge type, which is HAS_GhGV.
Dendritic star starting from the root node 11224 (degree 10.87K and node type Gene), and containing 10.17K nodes, with a maximal depth of 1, which are 52400 (node types GeneVariant, SingleNucleotideVariant and Variant), 52448 (node types GeneVariant, SingleNucleotideVariant and Variant), 68064 (node types GeneVariant, SingleNucleotideVariant and Variant), 68496 (node types GeneVariant, SingleNucleotideVariant and Variant) and 79968 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 10 node types, which are GeneVariant (10.17K nodes, 0.35%), Variant (10.17K nodes, 0.35%), SingleNucleotideVariant (8.91K nodes, 0.31%), Deletion (732 nodes, 0.03%), Duplication (273 nodes), Microsatellite (108 nodes), Indel (85 nodes), Insertion (63 nodes), Inversion (2 nodes) and Complex. Its edges have a single edge type, which is HAS_GhGV.
Dendritic star starting from the root node 35181 (degree 10.16K and node type Gene), and containing 9.53K nodes, with a maximal depth of 1, which are 68128 (node types GeneVariant, SingleNucleotideVariant and Variant), 79168 (node types GeneVariant, SingleNucleotideVariant and Variant), 79200 (node types GeneVariant, SingleNucleotideVariant and Variant), 79248 (node types GeneVariant, SingleNucleotideVariant and Variant) and 79264 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 11 node types, of which the 10 most common are Variant (9.53K nodes, 0.33%), GeneVariant (9.53K nodes, 0.33%), SingleNucleotideVariant (7.31K nodes, 0.25%), Deletion (1.32K nodes, 0.05%), Duplication (495 nodes, 0.02%), Microsatellite (130 nodes), Insertion (130 nodes), Indel (116 nodes), CopyNumberGain (26 nodes) and CopyNumberLoss (4 nodes). Its edges have a single edge type, which is HAS_GhGV.
Dendritic star starting from the root node 208 (degree 21.42K and node type Gene), and containing 9.04K nodes, with a maximal depth of 1, which are 71856 (node types GeneVariant, SingleNucleotideVariant and Variant), 71872 (node types GeneVariant, SingleNucleotideVariant and Variant), 71888 (node types GeneVariant, SingleNucleotideVariant and Variant), 71904 (node types GeneVariant, SingleNucleotideVariant and Variant) and 71920 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 11 node types, of which the 10 most common are GeneVariant (9.04K nodes, 0.31%), Variant (9.04K nodes, 0.31%), SingleNucleotideVariant (8.25K nodes, 0.29%), Deletion (433 nodes, 0.02%), Duplication (163 nodes), Microsatellite (116 nodes), Indel (40 nodes), Insertion (30 nodes), Inversion (3 nodes) and CopyNumberGain. Its edges have a single edge type, which is HAS_GhGV.
Dendritic star starting from the root node 29687 (degree 8.26K and node type Gene), and containing 7.74K nodes, with a maximal depth of 1, which are 73408 (node types GeneVariant, SingleNucleotideVariant and Variant), 73424 (node types GeneVariant, SingleNucleotideVariant and Variant), 73440 (node types GeneVariant, SingleNucleotideVariant and Variant), 73456 (node types GeneVariant, SingleNucleotideVariant and Variant) and 73472 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 11 node types, of which the 10 most common are Variant (7.74K nodes, 0.27%), GeneVariant (7.74K nodes, 0.27%), SingleNucleotideVariant (6.56K nodes, 0.23%), Deletion (637 nodes, 0.02%), Duplication (319 nodes, 0.01%), Microsatellite (103 nodes), Indel (76 nodes), Insertion (46 nodes), ProteinOnly and Inversion. Its edges have a single edge type, which is HAS_GhGV.
And other 15.70K dendritic stars.
A dendritic tendril star is a dendritic tree with a depth greater than one, where the arms of the star are tendrils. We have detected 324 dendritic tendril stars in the graph, involving a total of 27.44K nodes (0.96%) and 54.89K edges (0.17%), with the largest one involving 2.74K nodes and 5.47K edges. The detected dendritic tendril stars, sorted by decreasing size, are:
Dendritic tendril star starting from the root node 2340462 (degree 3.81K and node types Chemical and Compound), and containing 2.74K nodes, with a maximal depth of 2, which are 2545936 (node type Product), 2545968 (node type Product), 2545984 (node type Product), 2546000 (node type Product) and 2546016 (node type Product). Its nodes have 3 node types, which are Product (2.73K nodes, 0.10%), Chemical and Compound. Its edges have a single edge type, which is HAS_ChPR.
Dendritic tendril star starting from the root node 2350055 (degree 5.50K and node types Chemical and Compound), and containing 2.07K nodes, with a maximal depth of 2, which are 2621056 (node type Product), 2621072 (node type Product), 2621088 (node type Product), 2621104 (node type Product) and 2621120 (node type Product). Its nodes have 3 node types, which are Product (2.07K nodes, 0.07%), Chemical and Compound. Its edges have a single edge type, which is HAS_ChPR.
Dendritic tendril star starting from the root node 2662 (degree 1.71K and node type Gene), and containing 1.60K nodes, with a maximal depth of 2, which are 82592 (node types GeneVariant, SingleNucleotideVariant and Variant), 91440 (node types GeneVariant, SingleNucleotideVariant and Variant), 876032 (node types GeneVariant, SingleNucleotideVariant and Variant), 886896 (node types GeneVariant, SingleNucleotideVariant and Variant) and 909104 (node types GeneVariant, SingleNucleotideVariant and Variant). Its nodes have 9 node types, which are GeneVariant (1.60K nodes, 0.06%), Variant (1.60K nodes, 0.06%), SingleNucleotideVariant (1.49K nodes, 0.05%), Deletion (50 nodes), Microsatellite (29 nodes), Duplication (18 nodes), Indel (7 nodes), Insertion (7 nodes) and Gene. Its edges have 2 edge types, which are HAS_GhGV (1.58K edges, 98.94%) and HAS_GhV (17 edges, 1.06%).
Dendritic tendril star starting from the root node 2346404 (degree 2.61K and node types Chemical and Compound), and containing 1.28K nodes, with a maximal depth of 2, which are 2578512 (node type Product), 2578528 (node type Product), 2578544 (node type Product), 2578560 (node type Product) and 2578576 (node type Product). Its nodes have 3 node types, which are Product (1.28K nodes, 0.04%), Chemical and Compound. Its edges have a single edge type, which is HAS_ChPR.
Dendritic tendril star starting from the root node 23580 (degree 6.03K and node types Chemical and Compound), and containing 1.16K nodes, with a maximal depth of 2, which are 2496240 (node type Product), 2496256 (node type Product), 2496272 (node type Product), 2496288 (node type Product) and 2496320 (node type Product). Its nodes have 6 node types, which are Product (1.16K nodes, 0.04%), VariantAnnotation, Variant, VariantPhenotypeAnnotation, GeneVariant and SingleNucleotideVariant. Its edges have 3 edge types, which are HAS_ChPR (1.16K edges, 99.74%), ASSOCIATES_VAaV (2 edges, 0.17%) and ASSOCIATES_VAaCH.
Dendritic tendril star starting from the root node 2339131 (degree 5.25K and node types Chemical and Compound), and containing 1.12K nodes, with a maximal depth of 2, which are 2399728 (node type Product), 2399744 (node type Product), 2399760 (node type Product), 2399776 (node type Product) and 2399792 (node type Product). Its nodes have 12 node types, of which the 10 most common are Product (1.11K nodes, 0.04%), Phenotype (5 nodes), SideEffect (5 nodes), Compound (3 nodes), Salt (3 nodes), Chemical (3 nodes), GeneVariant, Variant, VariantAnnotation and PharmacologicClass. Its edges have 6 edge types, which are HAS_ChPR (1.11K edges, 98.93%), CAUSES_CHcSE (5 edges, 0.45%), PART_OF_CpoSA (3 edges, 0.27%), ASSOCIATES_VAaV (2 edges, 0.18%), INCLUDES_PCiCH and ASSOCIATES_VAaCH.
And other 318 dendritic tendril stars.
A tendril is a path starting from a node of degree one, connected to a strongly connected component. We have detected 10.51K tendrils in the graph, involving a total of 10.75K nodes (0.37%) and 21.49K edges (0.07%), with the largest one involving 3 nodes and 6 edges. The detected tendrils, sorted by decreasing size, are:
Tendril starting from the root node 2626507 (degree 8), and containing 3 nodes, with a maximal depth of 3, which are 2627201, 2629047 and 2632367. Its nodes have 2 node types, which are Phenotype (3 nodes) and Symptom (3 nodes). Its edges have a single edge type, which is IS_A_SiaS.
Tendril starting from the root node 323928 (degree 153), and containing 3 nodes, with a maximal depth of 3, which are 325012, 325013 and 325511. Its nodes have a single node type, which is CellularComponent. Its edges have a single edge type, which is IS_A_CCiaCC.
Tendril starting from the root node 122236 (degree 22 and node types Disease and Phenotype), and containing 3 nodes, with a maximal depth of 3, which are 108161 (node types Disease and Phenotype), 117977 (node types Disease and Phenotype) and 132819 (node type Gene). Its nodes have 3 node types, which are Phenotype (2 nodes), Disease (2 nodes) and Gene. Its edges have 2 edge types, which are IS_A_DiaD (3 edges) and ASSOCIATES_DaG (2 edges).
Tendril starting from the root node 314508 (degree 61), and containing 3 nodes, with a maximal depth of 3, which are 315542, 315543 and 315544. Its nodes have a single node type, which is BiologicalProcess. Its edges have a single edge type, which is IS_A_BPiaBP.
Tendril starting from the root node 290330 (degree 9), and containing 3 nodes, with a maximal depth of 3, which are 286756, 289678 and 286757. Its nodes have a single node type, which is MolecularFunction. Its edges have a single edge type, which is IS_A_MFiaMF.
Tendril starting from the root node 302490 (degree 6), and containing 3 nodes, with a maximal depth of 3, which are 296836, 296843 and 296837. Its nodes have a single node type, which is BiologicalProcess. Its edges have a single edge type, which is IS_A_BPiaBP.
And other 10.50K tendrils.
First-order LINE embedding. Do know that the embedding is supervised. That is, it has access to all edges in the graph but does not have access to node types or edge types. This means that the separability achieved there implies that this simple model can learn these characteristics.
TSNE decomposition and properties distribution of the PharMeBINet graph using the First-order LINE node embedding: (a) Node degrees heatmap. (b) Node types: 'GeneVariant' in blue, 'Interaction' in orange, 'Product' in red, 'Chemical' in cyan, 'Deletion' in green, 'Gene' in yellow, 'Duplication' in purple, and Other 46 node types in pink. The node types do not appear to form recognizable clusters (Balanced accuracy: 43.75% ± 0.38%). (c) Existent and non-existent edges: 'Non-existent' in blue and 'Existent' in orange. The edge prediction form some clusters (Balanced accuracy: 78.61% ± 2.17%). (d) Euclidean distance heatmap. This metric is a good edge prediction feature (Balanced accuracy: 88.14% ± 0.20%). (e) Cosine similarity heatmap. This metric is a good edge prediction feature (Balanced accuracy: 88.14% ± 0.21%). Do note that the cosine similarity has been shifted from the range of [-1, 1] to the range [0, 2] to be visualized in a logarithmic heatmap. (f) Adamic-Adar heatmap. This metric is a good edge prediction feature (Balanced accuracy: 69.69% ± 0.62%). (g) Jaccard Coefficient heatmap. This metric is a good edge prediction feature (Balanced accuracy: 70.17% ± 0.58%). (h) Preferential Attachment heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 57.26% ± 0.61%). (j) Resource Allocation Index heatmap. This metric is a good edge prediction feature (Balanced accuracy: 71.60% ± 0.52%). (k) Edge types: 'RESEMBLES CRC' in blue, 'MIGHT SUBCELLULAR LOCATES ImslCC' in orange, 'INTERACTS CiC' in red, 'INTERACTS PiI' in cyan, 'INTERACTS IiP' in green, 'EXPRESSES AeG' in yellow, 'REGULATES GRG' in purple, and Other 174 edge types in pink. The edge types form some clusters (Balanced accuracy: 71.68% ± 0.85%). (i) Euclidean distance distribution. Euclidean distance values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (l) Cosine similarity distribution. Cosine similarity values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (m) Adamic-Adar distribution. Adamic-Adar values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (n) Jaccard Coefficient distribution. Jaccard Coefficient values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (o) Preferential Attachment distribution. Preferential Attachment values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (p) Resource Allocation Index distribution. Resource Allocation Index values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale.
In the heatmaps, a, d, e, f, g, h, and j, low and high values appear in red and blue hues, respectively. Intermediate values appear in either a yellow or cyan hue. The values are on a logarithmic scale The separability considerations for figures b, c, d, e, f, g, h, j, and k derive from evaluating a Decision Tree trained on five Monte Carlo holdouts, with a 70/30 split between training and test sets. We have sampled 10.0 thousand existing and 10.0 thousand non-existing edges. We have sampled the non-existent edges' source and destination nodes by avoiding any disconnected nodes present in the graph to avoid biases.
Here is second-order LINE, similar considerations apply.
TSNE decomposition and properties distribution of the PharMeBINet graph using the Second-order LINE node embedding: (a) Node degrees heatmap. (b) Node types: 'GeneVariant' in blue, 'Interaction' in orange, 'Product' in red, 'Chemical' in cyan, 'Deletion' in green, 'Gene' in yellow, 'Duplication' in purple, and Other 46 node types in pink. The node types do not appear to form recognizable clusters (Balanced accuracy: 23.56% ± 0.08%). (c) Existent and non-existent edges: 'Non-existent' in blue and 'Existent' in orange. The edge prediction form recognizable clusters (Balanced accuracy: 82.65% ± 0.19%). (d) Euclidean distance heatmap. This metric is a good edge prediction feature (Balanced accuracy: 73.68% ± 0.56%). (e) Cosine similarity heatmap. This metric is a good edge prediction feature (Balanced accuracy: 73.82% ± 0.45%). Do note that the cosine similarity has been shifted from the range of [-1, 1] to the range [0, 2] to be visualized in a logarithmic heatmap. (f) Adamic-Adar heatmap. This metric is a good edge prediction feature (Balanced accuracy: 69.69% ± 0.62%). (g) Jaccard Coefficient heatmap. This metric is a good edge prediction feature (Balanced accuracy: 70.17% ± 0.58%). (h) Preferential Attachment heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 57.26% ± 0.61%). (j) Resource Allocation Index heatmap. This metric is a good edge prediction feature (Balanced accuracy: 71.60% ± 0.52%). (k) Edge types: 'RESEMBLES CRC' in blue, 'MIGHT SUBCELLULAR LOCATES ImslCC' in orange, 'INTERACTS CiC' in red, 'INTERACTS PiI' in cyan, 'INTERACTS IiP' in green, 'EXPRESSES AeG' in yellow, 'REGULATES GRG' in purple, and Other 174 edge types in pink. The edge types form recognizable clusters (Balanced accuracy: 83.36% ± 4.07%). (i) Euclidean distance distribution. Euclidean distance values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (l) Cosine similarity distribution. Cosine similarity values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (m) Adamic-Adar distribution. Adamic-Adar values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (n) Jaccard Coefficient distribution. Jaccard Coefficient values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (o) Preferential Attachment distribution. Preferential Attachment values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (p) Resource Allocation Index distribution. Resource Allocation Index values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale.
In the heatmaps, a, d, e, f, g, h, and j, low and high values appear in red and blue hues, respectively. Intermediate values appear in either a yellow or cyan hue. The values are on a logarithmic scale The separability considerations for figures b, c, d, e, f, g, h, j, and k derive from evaluating a Decision Tree trained on five Monte Carlo holdouts, with a 70/30 split between training and test sets. We have sampled 10.0 thousand existing and 10.0 thousand non-existing edges. We have sampled the non-existent edges' source and destination nodes by avoiding any disconnected nodes present in the graph to avoid biases.
Do you have a Twitter handle? I'd like to share an animation of your graph as soon as I integrate it with my library on the library Twitter account.
Thank you for the analysis results. We will have an in-depth look at the results and your tool on Monday. My twitter handle is the same as on GitHub: @AstrorEnales
As aforementioned, I have shared a couple of animations of tasks (edge-label and edge prediction) executed on your graph on Twitter. You should find yourself tagged.
I have now integrated this version of your graph into my library graph retrieval; please let me know when you publish the following versions if you'd like to see them integrated. It's just a matter of a couple of minutes.
The graph can now be readily retrieved by using:
from grape.datasets.zenodo import PharMeBINet
graph = PharMeBINet()
You can find the tutorial to run what you see in the last few comments and the Twitter post here.
Hello, and thank you for creating this resource.
Could you make available this resource in more open formats than the Neo4J one, such as a TSV with the node list and one for the edge list? Having the resource available in TSV would let me use it with the plethora of graph processing tools that are not Neo4J.
Something of the sorts, for instance:
Edge list
Thank you, Luca