cthoyt / obo-foundry-graph

Demonstrate combining all OBO Foundry ontologies via Bioregistry, Bioontologies, and ROBOT
MIT License
1 stars 0 forks source link

CBOW embedding visualization #3

Open LucaCappelletti94 opened 2 years ago

LucaCappelletti94 commented 2 years ago

Analogous to the TransE visualization, but this time with CBOW (first order random walk sampling).

image

TSNE decomposition and properties distribution of the OBO Foundry graph using the Node2Vec CBOW node embedding: (a) Node degrees heatmap. (b) Detected node ontologies: 'NCBI Taxonomy' in blue, 'Unknown' in orange, 'CHEBI' in red, 'Foundational Model of Anatomy Ontology (subset)' in cyan, 'NCBI Gene' in green, 'PRotein Ontology (PRO)' in yellow, 'Gene Ontology' in purple, and Other 53 ontologies in pink. The node ontologies do not appear to form recognizable clusters (Balanced accuracy: 48.33% ± 1.40%). (c) Connected components: 'Main component' in blue, 'Minor components' in orange, and 'Tuples' in red. The components do not appear to form recognizable clusters (Balanced accuracy: 52.48% ± 9.14%). (d) Existent and non-existent edges: 'Existent' in blue and 'Non-existent' in orange. The existent & non-existent edges form some possible clusters (Balanced accuracy: 62.76% ± 0.58%). (e) Euclidean distance heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 59.79% ± 0.15%). (f) Cosine similarity heatmap. This metric is a good edge prediction feature (Balanced accuracy: 77.48% ± 0.21%). Do note that the cosine similarity has been shifted from the range of [-1, 1] to the range [0, 2] to be visualized in a logarithmic heatmap. (g) Adamic-Adar heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.50% ± 0.08%). (h) Jaccard Coefficient heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.37% ± 0.11%). (j) Preferential Attachment heatmap. The metric is not useful as an edge prediction feature (Balanced accuracy: 54.65% ± 0.11%). (k) Resource Allocation Index heatmap. This metric may be considered an edge prediction feature (Balanced accuracy: 56.52% ± 0.09%). (i) Edge types: 'RDFS subClassOf' in blue, 'HTTP //WWW.obofoundry.org/ro/ro.owl#has proper part' in orange, 'Ro 0002160' in red, 'PR#has gene template' in cyan, 'Ro 0000087' in green, 'Bfo 0000050' in yellow, 'Bfo 0000051' in purple, and Other 935 edge types in pink. The edge types do not appear to form recognizable clusters (Balanced accuracy: 32.67% ± 1.45%). (l) Euclidean distance distribution. Euclidean distance values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (m) Cosine similarity distribution. Cosine similarity values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (n) Adamic-Adar distribution. Adamic-Adar values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (o) Jaccard Coefficient distribution. Jaccard Coefficient values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (p) Preferential Attachment distribution. Preferential Attachment values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. (q) Resource Allocation Index distribution. Resource Allocation Index values are on the horizontal axis and edge counts are on the vertical axis on a logarithmic scale. In the heatmaps, a, e, f, g, h, j, and k, low and high values appear in red and blue hues, respectively. Intermediate values appear in either a yellow or cyan hue. The values are on a logarithmic scale. The separability considerations for figures b, c, d, e, f, g, h, j, k, and i derive from evaluating a Decision Tree trained on five Monte Carlo holdouts, with a 70/30 split between training and test sets. We have sampled 20.0 thousand existing and 20.0 thousand non-existing edges. We have sampled the non-existent edges' source and destination nodes by avoiding any disconnected nodes present in the graph to avoid biases.

LucaCappelletti94 commented 2 years ago

Here is the compressed CSV with the CBOW embedding.